k8s 容器启动命令 k8s容器启动失败

转载

mob64ca14089531 2024-02-27 12:54:41

文章标签 k8s 容器启动命令 kubernetes docker 容器 centos 文章分类 云原生云计算

k8s踩过的坑

文章目录

k8s踩过的坑

1、命令自动补全
2、kubelet服务无法启动，报错`Failed to start Kubernetes API Server`。
3、容器镜像加速
4、容器时间和宿主机时间不一致
5、创建pod报错No API token found for service account "default", retry after the token is automatically
6、 DNS 间歇性 5 秒延迟问题

问题说明及原因
如何避免这个问题

7、k8s证书过期

①、说明
②、证书过期问题解决办法
③、过期处理报错信息
④、集群确认

Worker节点不能启动
许多Pod一直Crash或不能正常访问
许多Pod一直Crash或不能正常访问
持续更新…………

记录自己踩过的坑和同事的经历

1、命令自动补全

yum -y install bash-completion
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc

2、kubelet服务无法启动，报错`Failed to start Kubernetes API Server`。

是由于系统交换内存导致，关闭即可：swapoff -a，重新启动就好了

3、容器镜像加速

[root@master ~]# /etc/docker/deamon.json
{
    "registry-mirrors": ["https://registry.docker-cn.com"],
    "graph": "/data/lib/docker",
    "bip": "172.16.0.1/24"
}

//registry-mirrors　　仓库地址，这里修改为国内的官方加速地址
//graph　　　　　　 镜像、容器的存储路径，默认是/var/lib/docker
//bip　　　　　　　  容器的IP网段

4、容器时间和宿主机时间不一致

本机时区复制到宿主机即可：

docker cp /etc/localtime a9c27487faf4:/etc/localtime

然后重启容器

或者，容器内修改时区

docker exec -it <容器名> /bin/bash
ln -sf /usr/share/zoneinfo/Asia/Shanghai    /etc/localtime
docker restart <容器名>

或者修改Dockerfile

在dockerfile文件里添加

RUN /bin/cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone

5、创建pod报错No API token found for service account “default”, retry after the token is automatically

Error from server (ServerTimeout): error when creating "./pod1-deployment.yaml": No API token found for service account "default", retry after the token is automatically created and added to the service account

原因：是service account没有设置API token引起的

解决方式一：禁用ServiceAccount

编辑/etc/kubenetes/apiserver：
将以下这行中的ServiceAccount删除即可

KUBE_ADMISSION_CONTROL="--admission-control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota"

改为：

KUBE_ADMISSION_CONTROL="--admission-control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"

这种方式比较粗暴，可能会遇到必须要用ServiceAccount的情况。

解决方式二：配置ServiceAccount

①、首先生成密钥：

openssl genrsa -out /etc/kubernetes/serviceaccount.key 2048

②、编辑/etc/kubenetes/apiserver
添加以下内容：

KUBE_API_ARGS="--service_account_key_file=/etc/kubernetes/serviceaccount.key"

③、再编辑/etc/kubernetes/controller-manager
添加以下内容：

KUBE_CONTROLLER_MANAGER_ARGS="--service_account_private_key_file=/etc/kubernetes/serviceaccount.key"

最后无论是哪种解决方式都需要再重启kubernetes服务：

systemctl restart etcd kube-apiserver kube-controller-manager kube-scheduler

推荐使用第二种方式，因为在后面配置默认从私有仓库拉取镜像也必须要有ServiceAccount。

6、 DNS 间歇性 5 秒延迟问题

问题说明及原因

在 Kubernetes 集群会碰到这个间歇性 5 延迟的问题，Weave works 发布了一篇博客 Racy conntrack and DNS lookup timeouts 详细介绍了问题的原因。

简单来说，由于 UDP 是无连接的，内核 netfilter 模块在处理同一个 socket 上的并发 UDP 包时就可能会有三个竞争问题。以下面的 conntrack 和 DNAT 工作流程为例：

+---------------------------+      Create a conntrack for a given packet if
|                           |      it does not exist; IP_CT_DIR_REPLY is
|    1. nf_conntrack_in     |      an invert of IP_CT_DIR_ORIGINAL tuple, so
|                           |      src of the reply tuple is not changed yet.
+------------+--------------+
             |
             v
+---------------------------+
|                           |
|     2. ipt_do_table       |      Find a matching DNAT rule.
|                           |
+------------+--------------+
             |
             v
+---------------------------+
|                           |      Update the reply tuples src part according
|    3. get_unique_tuple    |      to the DNAT rule in a way that it is not used
|                           |      by any already confirmed conntrack.
+------------+--------------+
             |
             v
+---------------------------+
|                           |      Mangle the packet destination port and address
|     4. nf_nat_packet      |      according to the reply tuple.
|                           |
+------------+--------------+
             |
             v
+----------------------------+
|                            |     Confirm the conntrack if there is no confirmed
|  5. __nf_conntrack_confirm |     conntrack with either the same original or
|                            |     a reply tuple; increment insert_failed counter
+----------------------------+     and drop the packet if it exists.

由于 UDP 的 connect 系统调用不会立即创建 conntrack 记录，而是在 UDP 包发送之后才去创建，这就可能会导致下面三种问题：

两个 UDP 包在第一步 nf*conntrack*in 中都没有找到 conntrack 记录，所以两个不同的包就会去创建相同的 conntrack 记录（注意五元组是相同的）。
一个 UDP 包还没有调用 get*unique*tuple 时 conntrack 记录就已经被另一个 UDP 包确认了。
两个 UDP 包在 ipt*do*table 中选择了两个不同端点的 DNAT 规则。

所有这三种场景都会导致最后一步 _*nf*conntrack_confirm 失败，从而一个 UDP 包被丢弃。由于 GNU C 库和 musl libc 库在查询 DNS 时，都会同时发出 A 和 AAAA DNS 查询，由于上述的内核竞争问题，就可能会发生其中一个包被丢掉的问题。丢弃之后客户端会超时重试，超时时间通常是 5 秒。

上述的第三个问题至今还没有修复，而前两个问题则已经修复了，分别包含在 5.0 和 4.19 中：

netfilter: nf_nat: skip nat clash resolution for same-origin entries (包含在内核 v5.0 中)
netfilter: nf_conntrack: resolve clash for matching conntracks (包含在内核 v4.19 中)

在公有云中，这些补丁有可能也会包含在旧的内核版本中。比如在 Azure 上，这两个问题已经包含在 v4.15.0-1030.31 和 v4.18.0-1006.6 中。

如何避免这个问题

要避免 DNS 延迟的问题，就要设法绕开上述三个问题，所以就有下面几种方法：

①. 禁止并发 DNS 查询，比如在 Pod 配置中开启 single-request-reopen 选项强制 A 查询和 AAAA 查询使用相同的 socket：

dnsConfig:
  options:
    - name: single-request-reopen

②. 禁用 IPv6 从而避免 AAAA 查询，比如可以给 Grub 配置 ipv6.disable=1 来禁止 ipv6（需要重启节点才可以生效）。

③. 使用 TCP 协议，比如在 Pod 配置中开启 use-vc 选项强制 DNS 查询使用 TCP 协议：

dnsConfig:
  options:
    - name: single-request-reopen
    - name: ndots
      value: "5"
    - name: use-vc

④. 使用 Nodelocal DNS Cache[5]，所有 Pod 的 DNS 查询都通过本地的 DNS 缓存查询，避免了 DNAT，从而也绕开了内核中的竞争问题。你可以执行下面的命令来部署它（注意它会修改 Kubelet 配置并重启 Kubelet）：

kubectl apply -f https://github.com/feiskyer/kubernetes-handbook/raw/master/examples/nodelocaldns/nodelocaldns-kubenet.yaml

7、k8s证书过期

①、说明

一般正常安装k8s集群，集群证书的有效期是一年，包括以下证书：

• apiserver
• apiserver-kubelet-client
• apiserver-etcd-client
• front-proxy-client
• etcd/server
• etcd/peer
• etcd/healthcheck-client

②、证书过期问题解决办法

对于手动生成的证书

手动安装过程中，只需指定证书的过期时间为N天即可

对于kubeadm

方式一：升级K8S集群，自动更新证书

方式二：修改kubeadm并重新编译

方式三：重新生成证书

③、过期处理报错信息

[root@k8s-master03 ~]# kubectl get pod
Unable to connect to the server: x509: certificate has expired or is not yet valid

日志信息

The currently active client certificate has expired, but the server is not responsive. A restart may be necessary to retrieve new initial credentials.

证书备份

[root@k8s-master03 ~]# cp -rp /etc/kubernetes /etc/kubernetes.bak

apiserver证书

[root@k8s-master03 ~]# rm -f /etc/kubernetes/pki/apiserver*

front-proxy-client证书

[root@k8s-master03 ~]# rm -f /etc/kubernetes/pki/front-proxy-client.*

etcd证书

[root@k8s-master03 ~]# rm -rf /etc/kubernetes/pki/etcd/healthcheck-client.*
[root@k8s-master03 ~]# rm -rf /etc/kubernetes/pki/etcd/server.*
[root@k8s-master03 ~]# rm -rf /etc/kubernetes/pki/etcd/peer.*

重新生成证书

[root@k8s-master02 ~]# kubeadm alpha phase certs all --config kubeadm-config.yaml
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [k8s-master02 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local k8s-master01 k8s-master02 k8s-master03 k8s-master-lb] and IPs [10.96.0.1 192.168.20.21 192.168.20.10 192.168.20.20 192.168.20.21 192.168.20.22 192.168.20.10]
[certificates] Generated front-proxy-client certificate and key.
[certificates] Generated etcd/healthcheck-client certificate and key.
[certificates] Generated apiserver-etcd-client certificate and key.
[certificates] Generated etcd/server certificate and key.
[certificates] etcd/server serving cert is signed for DNS names [k8s-master02 localhost k8s-master02] and IPs [127.0.0.1 ::1 192.168.20.21]
[certificates] Generated etcd/peer certificate and key.
[certificates] etcd/peer serving cert is signed for DNS names [k8s-master02 localhost k8s-master02] and IPs [192.168.20.21 127.0.0.1 ::1 192.168.20.21]
[certificates] valid certificates and keys now exist in "/etc/kubernetes/pki"
[certificates] Using the existing sa key.

重新生成配置文件

[root@k8s-master02 ~]# mv /etc/kubernetes/ admin.conf kubelet.conf pki/ scheduler.conf controller-manager.conf manifests/ pki.bak/ tmp/ 
[root@k8s-master02 ~]# mv /etc/kubernetes/*.conf /tmp/
[root@k8s-master02 ~]# kubeadm alpha phase kubeconfig all --config kubeadm-config.yaml
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"

重启kubelet

[root@k8s-master01 ~]# systemctl restart kubelet

④、集群确认

证书过期时间确认

# 注意：cfssl需要自行安装
[root@k8s-master01 ~]# cfssl-certinfo -cert /etc/kubernetes/pki/etcd/server.crt |grep not
  "not_before": "2020-06-09T05:15:32Z",
  "not_after": "2021-06-09T05:15:32Z",
#cfssl安装
[root@k8s-master01 ~]# wget https://pkg.cfssl.org/R1.2/cfssl_linux-amd64
[root@k8s-master01 ~]# chmod +x cfssl_linux-amd64
[root@k8s-master01 ~]# mv cfssl_linux-amd64 /usr/local/bin/cfssl
[root@k8s-master01 ~]# wget https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64
[root@k8s-master01 ~]# chmod +x cfssljson_linux-amd64
[root@k8s-master01 ~]# mv cfssljson_linux-amd64 /usr/local/bin/cfssljson
[root@k8s-master01 ~]# wget https://pkg.cfssl.org/R1.2/cfssl-certinfo_linux-amd64
[root@k8s-master01 ~]# chmod +x cfssl-certinfo_linux-amd64
[root@k8s-master01 ~]# mv cfssl-certinfo_linux-amd64 /usr/local/bin/cfssl-certinfo
[root@k8s-master01 ~]# export PATH=/usr/local/bin:$PATH

集群状态确认

[root@k8s-master01 ~]# kubectl get no
NAME           STATUS   ROLES    AGE     VERSION
k8s-master01   Ready    master   6d22h   v1.12.3
k8s-master02   Ready    master   6d22h   v1.12.3
k8s-master03   Ready    master   6d22h   v1.12.3
k8s-node01     Ready    <none>   6d21h   v1.12.3
k8s-node02     Ready    <none>   6d21h   v1.12.3

Worker节点不能启动

Master 节点的 IP 地址变化，导致 worker 节点不能启动。请重装集群，并确保所有节点都有固定内网 IP 地址。

许多Pod一直Crash或不能正常访问

kubectl get pods --all-namespaces

重启后会发现许多 Pod 不在 Running 状态，此时，请使用如下命令删除这些状态不正常的 Pod。通常，您的 Pod 如果是使用 Deployment、StatefulSet 等控制器创建的，kubernetes 将创建新的 Pod 作为替代，重新启动的 Pod 通常能够正常工作。

kubectl delete pod <pod-name> -n <pod-namespece>

Worker节点不能启动

Master 节点的 IP 地址变化，导致 worker 节点不能启动。请重装集群，并确保所有节点都有固定内网 IP 地址。

许多Pod一直Crash或不能正常访问

kubectl get pods --all-namespaces

kubectl delete pod <pod-name> -n <pod-namespece>

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：AndroidStudio如何上传文件到虚拟机上 androidstudio怎么导入rar

下一篇：python 网页正则 python正则教程

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯