tcp write error broken pipe - Kubernetes

今天在我们的一个由frankenphp（Caddy + PHP）支撑的服务访问比较慢，但一小段时间后又自己好了。它是运行在 Kubernetes中的，看服务业务日志没发现报错，看Pod日志发现以下报错：

# kubectl logs -f --tail 10 -n quickon-system zentaopaas-frankenphp-5545d648c6-rvlnq
{"level":"warn","ts":1759973384.78088,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.2.0:48146: write: broken pipe"}
{"level":"warn","ts":1759973517.6707053,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.2.0:25396: write: broken pipe"}
{"level":"info","ts":1759973625.34224,"logger":"tls","msg":"storage cleaning happened too recently; skipping for now","storage":"FileStorage:/data/caddy","instance":"25237dcd-61d2-4ae3-b9d9-5c968a7ef5b6","try_again":1760060025.342238,"try_again_in":86399.999999672}
{"level":"info","ts":1759973625.3431087,"logger":"tls","msg":"finished cleaning storage units"}
{"level":"warn","ts":1759973644.7477026,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.1.0:50412: write: broken pipe"}
{"level":"warn","ts":1759973702.706567,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.2.0:32784: write: broken pipe"}
{"level":"warn","ts":1759973781.1995344,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.2.0:61128: write: broken pipe"}
{"level":"warn","ts":1759973810.600418,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.2.0:28986: write: broken pipe"}
{"level":"warn","ts":1759973870.5739777,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.2.0:48154: write: broken pipe"}
{"level":"warn","ts":1759973870.5742023,"logger":"frankenphp","msg":"write error","error":"write tcp 10.42.8.201:80->10.42.2.0:64440: write: broken pipe"}

在上面的日志中，10.42.8.201 IP是Pod的IP，10.42.2.0和10.42.1.0都是集群内节点 flannel.1 网卡的IP，而 Nginx Ingress 的Pod也是在这两个节点上，他们都是 Pod CIDR中的IP。

这个错误看起来是从frankenphp到Nginx Ingress之间或是到节点之间出现了问题。

通过 https://github.com/caddyserver/caddy/issues/6000 看，mohammed90 解释称“ This is harmless, as you're experiencing. This usually happens when the client disconnects without completing the handshake and the necessary components of a proper HTTP connection. ”意思是说 这是无害的，通常发生在建立好连接之前客户端就断开了。他还提到“They should be concerning to you because they mean the client is not closing the connection properly/cleanly. It's harmless in in smaller numbers as it can mean the client closed the webpage/app in the middle of the loading request. However, in large numbers this is a form of an attack to exhaust your server. It's good to know about it.”意思是说这些错误被记录下来和被注意到是有意义的，特别是当大量出现时，如果出现较少通常没有关系可以忽略。

现在回头看这个服务慢的问题，还是得用分段、分层排除法：

当访问一个服务慢的时候，Nginx Ingress代理的其他服务是否也慢
绕过Nginx Ingress直接在服务的 Pod 内访问是否也慢
网络方面的问题也可以抓包看看

其他改进或需要注意：

Pod是否设置了TCP探针、HTTP探针
监控响应速度
良好的服务日志设计