Kubernetes常见问题处理办法 - Kubernetes

问题1：所有 Pod 均无法访问，例如 CoreDNS Pod 日志显示无法读取 Kubernetes API，所有具备健康检查探针的 Pod 均无法通过健康检查

解决方案：检查服务器网卡 cni0 状态、路由和防火墙（包含 iptables ），检查 Pod 网络（cni0网卡网络）的路由是否存在异常，正常情况下在 K8S 节点内的 Pod IP 应该都能访问到，ClusterIP 可能无法在主机上直接访问到（当然也可能可以访问到）。

问题2：Pod 经常重启，通过 kubectl describe pod 和 kubectl logs --previous 均没有有效信息，事件中显示：Back-off restarting failed container。

解决方案：通过 kubectl get pod -o wide 获取该 Pod 所在的节点，通过 kubectl describe node 查看节点的事件，检查事件中是否有 OomGuardKillContainer、oom-guard 等关键字，确定 Pod 是否是因为内存不够导致进程被杀死而重启。

问题3：Pod 处于Pending 状态，检查 Pod 发现事件 0/8 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 7 node(s) didn't match Pod's node affinity/selector. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.。可知这个Pod 存在亲和性和选择器，而符合条件的节点被标记为不可达到的污点，这说明节点处于NotReady的状态，经过检查节点确实如此。通过解决节点问题解决。

问题4：Pod 处于 Error 和 ContainerStatusUnknown，通过 kubectl describe 命令去检查 Pod，提示 The node was low on resource: ephemeral-storage. Container jnlp was using 92Ki, which exceeds its request of 0. Container azcopy was using 72Ki, which exceeds its request of 0. Container package was using 1570724Ki, which exceeds its request of 0. ，检查该 Pod 对应的 Node 提示 NodeHasDiskPressure 。

解决方案：工作负载（容器）没有设置临时存储的资源请求（requests.ephemeral-storage），因此系统默认其需求为0。当这些容器实际使用了存储时，就很容易触发驱逐策略。这例问题就是该Pod因为节点临时存储（ephemeral-storage）不足而被驱逐。解决方法是扩容该节点的存储。另外最好是为这些Pod设置资源限制：

# 在Pod定义中设置合理的存储资源请求和限制
spec:
  containers:
  - name: package
    resources:
      requests:
        ephemeral-storage: "2Gi"  # 根据实际使用量设置
      limits:
        ephemeral-storage: "3Gi"  # 设置上限防止异常占用