排查Kubernetes集群内Pod无法解析svc域名的问题
- 2026-01-12 19:41:22
- 丁国栋
- 原创 19
两个现象:
1. Pod 内无法解析 Kubernetes 内 Service 的域名
2. Pod 内无法解析外部的域名
从现象看,是Pod配置的DNS服务器的问题。
查看DNS配置 /etc/resolv.conf 文件内容:
# cat /etc/resolv.conf search cne-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5从nameserver的IP看,这是集群里的一个IP,通过对比Pod CIDR和Service CIDR看,这是一个Service 的 IP。通过 kubectl get svc -n kube-system 确认它就是 coredns 的 service IP,类型是 ClusterIP。
注:集群是使用 k3s 部署的,通过查看 k3s 的服务命令行也可以看到 --cluster-cidr 10.42.0.0/16 和 --service-cidr 10.43.0.0/16,前者是Pod CIDR,后者是service CIDR。
所以怀疑是 coredns 出了问题。先检查Pod是否Running,是否有重启次数。发现它是 Running状态,restarts 次数是0。
通过 kubectl describe pod 检查没有事件。通过 kubectl logs 检查Pod日志,日志反复打印:
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override [WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server对比一个健康的集群环境,这个日志也是有的,应该不是导致这个问题原因。
在节点上查看公网域名是否可以解析,确定没有问题。
按照官网文档 https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/dns-debugging-resolution/ 创建一个含有nslookup工具的Pod
kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml文件内容:
# curl -L https://k8s.io/examples/admin/dns/dnsutils.yaml -o -
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: registry.k8s.io/e2e-test-images/agnhost:2.39
imagePullPolicy: IfNotPresent
restartPolicy: Always
执行:kubectl exec -i -t dnsutils -- nslookup kubernetes.default 发现没有问题:
root@k3s01:~# kubectl exec -i -t dnsutils -- nslookup kubernetes.default Server: 10.43.0.10 Address: 10.43.0.10#53 Name: kubernetes.default.svc.cluster.local Address: 10.43.0.1 root@k3s01:~#检查问题svc的域名
root@k3s01:~# kubectl exec -i -t dnsutils -- nslookup cne-api.cne-system.svc.cluster.local Server: 10.43.0.10 Address: 10.43.0.10#53 Name: cne-api.cne-system.svc.cluster.local Address: 10.43.181.241 root@k3s01:~# kubectl exec -i -t dnsutils -- nslookup cne-api.cne-system.svc Server: 10.43.0.10 Address: 10.43.0.10#53 Name: cne-api.cne-system.svc.cluster.local Address: 10.43.181.241 root@k3s01:~#发现也没有问题。
回到问题存在的命名空间,再检查问题Pod内的解析,问题确实依然存在。
在问题命名空间内创建一个诊断Pod,kubectl run -it dnsutils --namespace cne-system --image=registry.k8s.io/e2e-test-images/agnhost:2.39 --rm=true --command -- bash
发现执行 nslookup cne-api.cne-system.svc 也是正常的,其他运行中的pod有的也可以正常解析,而和问题Pod同一个节点的Pod也无法正常解析。
在正常节点上执行:
root@k3s01:~# nc -zv 10.43.0.10 53 Connection to 10.43.0.10 53 port [tcp/domain] succeeded! root@k3s01:~#在问题节点上执行:
root@k3s03:~$ nc -zv 10.43.0.10 53 nc: connect to 10.43.0.10 port 53 (tcp) failed: No route to host root@k3s03:~$通过 kubectl describe node 命令检查问题节点,发现:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning CertificateExpirationWarning 42m (x358 over 59d) k3s-cert-monitor Node certificates require attention - restart k3s on this node to trigger automatic rotation: kubelet/client-kubelet.crt: certificate CN=system:node:k3s03,O=system:nodes will expire within 90 days at 2026-02-11T06:14:33Z, kubelet/serving-kubelet.crt: certificate CN=k3s03 will expire within 90 days at 2026-02-11T06:14:33Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z, kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z Warning CertificateExpirationWarning 12m (x2189 over 60d) k3s-cert-monitor Node certificates require attention - restart k3s on this node to trigger automatic rotation: kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z, kubelet/client-kubelet.crt: certificate CN=system:node:k3s03,O=system:nodes will expire within 90 days at 2026-02-11T06:14:33Z, kubelet/serving-kubelet.crt: certificate CN=k3s03 will expire within 90 days at 2026-02-11T06:14:33Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z
看起来需要重启 k3s 以处理证书问题。如果节点证书过期了,会影响kube-proxy,从而影响Pod、Service间通信。
重启问题节点后,再次执行 nc -zv 10.43.0.10 53 确实可以成功了。
root@k3s03:~$ nc -zv 10.43.0.10 53 Connection to 10.43.0.10 53 port [tcp/domain] succeeded! root@k3s03:~$
重启问题节点后发现Pod可以解析外部域名,但内部 Service 域名还有问题。因此准备检查coredns所在的节点,看看它是否也有这样的问题。确实它也有类似问题:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning CertificateExpirationWarning 85m (x367 over 59d) k3s-cert-monitor Node certificates require attention - restart k3s on this node to trigger automatic rotation: kubelet/client-kubelet.crt: certificate CN=system:node:k3s06,O=system:nodes will expire within 90 days at 2026-02-11T06:14:35Z, kubelet/serving-kubelet.crt: certificate CN=k3s06 will expire within 90 days at 2026-02-11T06:14:35Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z, kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z Warning CertificateExpirationWarning 55m (x2146 over 60d) k3s-cert-monitor Node certificates require attention - restart k3s on this node to trigger automatic rotation: kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z, kubelet/client-kubelet.crt: certificate CN=system:node:k3s06,O=system:nodes will expire within 90 days at 2026-02-11T06:14:35Z, kubelet/serving-kubelet.crt: certificate CN=k3s06 will expire within 90 days at 2026-02-11T06:14:35Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z Normal RegisteredNode 21m node-controller Node k3s06 event: Registered Node k3s06 in Controller root@k3s01:~#因此也重启这个节点。
--