排查Kubernetes集群内Pod无法解析svc域名的问题

2026-01-12 19:41:22
丁国栋
原创 19
摘要:本文记录一个Kubernetes集群内Pod无法解析svc域名的问题。

两个现象:

1. Pod 内无法解析 Kubernetes 内 Service 的域名

2. Pod 内无法解析外部的域名

从现象看,是Pod配置的DNS服务器的问题。

查看DNS配置 /etc/resolv.conf 文件内容:


# cat /etc/resolv.conf 
search cne-system.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.43.0.10
options ndots:5
从nameserver的IP看,这是集群里的一个IP,通过对比Pod CIDR和Service CIDR看,这是一个Service 的 IP。通过 kubectl get svc -n kube-system 确认它就是 coredns 的 service IP,类型是 ClusterIP。


注:集群是使用 k3s 部署的,通过查看 k3s 的服务命令行也可以看到 --cluster-cidr 10.42.0.0/16 和 --service-cidr 10.43.0.0/16,前者是Pod CIDR,后者是service CIDR。

所以怀疑是 coredns 出了问题。先检查Pod是否Running,是否有重启次数。发现它是 Running状态,restarts 次数是0。

通过 kubectl describe pod 检查没有事件。通过 kubectl logs 检查Pod日志,日志反复打印:


[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.override
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
对比一个健康的集群环境,这个日志也是有的,应该不是导致这个问题原因。


在节点上查看公网域名是否可以解析,确定没有问题。

按照官网文档 https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/dns-debugging-resolution/ 创建一个含有nslookup工具的Pod

kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
文件内容:
# curl -L https://k8s.io/examples/admin/dns/dnsutils.yaml -o -
apiVersion: v1
kind: Pod
metadata:
  name: dnsutils
  namespace: default
spec:
  containers:
  - name: dnsutils
    image: registry.k8s.io/e2e-test-images/agnhost:2.39
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
执行:kubectl exec -i -t dnsutils -- nslookup kubernetes.default 发现没有问题:
root@k3s01:~# kubectl exec -i -t dnsutils -- nslookup kubernetes.default
Server:         10.43.0.10
Address:        10.43.0.10#53
Name:   kubernetes.default.svc.cluster.local
Address: 10.43.0.1
root@k3s01:~#
检查问题svc的域名
root@k3s01:~# kubectl exec -i -t dnsutils -- nslookup cne-api.cne-system.svc.cluster.local
Server:         10.43.0.10
Address:        10.43.0.10#53
Name:   cne-api.cne-system.svc.cluster.local
Address: 10.43.181.241
root@k3s01:~# kubectl exec -i -t dnsutils -- nslookup cne-api.cne-system.svc
Server:         10.43.0.10
Address:        10.43.0.10#53
Name:   cne-api.cne-system.svc.cluster.local
Address: 10.43.181.241
root@k3s01:~# 
发现也没有问题。


回到问题存在的命名空间,再检查问题Pod内的解析,问题确实依然存在。

在问题命名空间内创建一个诊断Pod,kubectl run -it dnsutils --namespace cne-system --image=registry.k8s.io/e2e-test-images/agnhost:2.39 --rm=true --command -- bash

发现执行 nslookup cne-api.cne-system.svc 也是正常的,其他运行中的pod有的也可以正常解析,而和问题Pod同一个节点的Pod也无法正常解析。

在正常节点上执行:

root@k3s01:~# nc -zv 10.43.0.10 53
Connection to 10.43.0.10 53 port [tcp/domain] succeeded!
root@k3s01:~# 
在问题节点上执行:
root@k3s03:~$ nc -zv 10.43.0.10 53
nc: connect to 10.43.0.10 port 53 (tcp) failed: No route to host
root@k3s03:~$ 
通过 kubectl describe node 命令检查问题节点,发现:
Events:
  Type     Reason                        Age                   From              Message
  ----     ------                        ----                  ----              -------
  Warning  CertificateExpirationWarning  42m (x358 over 59d)   k3s-cert-monitor  Node certificates require attention - restart k3s on this node to trigger automatic rotation: kubelet/client-kubelet.crt: certificate CN=system:node:k3s03,O=system:nodes will expire within 90 days at 2026-02-11T06:14:33Z, kubelet/serving-kubelet.crt: certificate CN=k3s03 will expire within 90 days at 2026-02-11T06:14:33Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z, kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z
  Warning  CertificateExpirationWarning  12m (x2189 over 60d)  k3s-cert-monitor  Node certificates require attention - restart k3s on this node to trigger automatic rotation: kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z, kubelet/client-kubelet.crt: certificate CN=system:node:k3s03,O=system:nodes will expire within 90 days at 2026-02-11T06:14:33Z, kubelet/serving-kubelet.crt: certificate CN=k3s03 will expire within 90 days at 2026-02-11T06:14:33Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z

看起来需要重启 k3s 以处理证书问题。如果节点证书过期了,会影响kube-proxy,从而影响Pod、Service间通信。

重启问题节点后,再次执行 nc -zv 10.43.0.10 53 确实可以成功了。

root@k3s03:~$ nc -zv 10.43.0.10 53
Connection to 10.43.0.10 53 port [tcp/domain] succeeded!
root@k3s03:~$ 

重启问题节点后发现Pod可以解析外部域名,但内部 Service 域名还有问题。因此准备检查coredns所在的节点,看看它是否也有这样的问题。确实它也有类似问题:

Events:
  Type     Reason                        Age                   From              Message
  ----     ------                        ----                  ----              -------
  Warning  CertificateExpirationWarning  85m (x367 over 59d)   k3s-cert-monitor  Node certificates require attention - restart k3s on this node to trigger automatic rotation: kubelet/client-kubelet.crt: certificate CN=system:node:k3s06,O=system:nodes will expire within 90 days at 2026-02-11T06:14:35Z, kubelet/serving-kubelet.crt: certificate CN=k3s06 will expire within 90 days at 2026-02-11T06:14:35Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z, kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z
  Warning  CertificateExpirationWarning  55m (x2146 over 60d)  k3s-cert-monitor  Node certificates require attention - restart k3s on this node to trigger automatic rotation: kube-proxy/client-kube-proxy.crt: certificate CN=system:kube-proxy expired at 2025-07-02T07:07:23Z, kubelet/client-kubelet.crt: certificate CN=system:node:k3s06,O=system:nodes will expire within 90 days at 2026-02-11T06:14:35Z, kubelet/serving-kubelet.crt: certificate CN=k3s06 will expire within 90 days at 2026-02-11T06:14:35Z, k3s-controller/client-k3s-controller.crt: certificate CN=system:k3s-controller expired at 2025-07-02T07:07:23Z
  Normal   RegisteredNode                21m                   node-controller   Node k3s06 event: Registered Node k3s06 in Controller
root@k3s01:~# 
因此也重启这个节点。

--

发表评论
博客分类