windows 10主机安装vbox,vbox安装centOS7,centOS7 安装 docker。
docker 指定固定 IP,192.168.1.x,主机 IP 是 192.168.1.X

1、windows 10主机自己无端重启;
2、之后,换了一个新的网络环境,主机IP 为 192.168.0.X
docker 无法启动。

回到原来的 192.168.1.x 环境中进行恢复。
下面是恢复的步骤:

一、centOS启动后,docker 无法启动,排查过程

  • 1、执行 docker ps -a
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
  • 2、执行 systemctl start docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
  • 3、执行 systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2021-02-15 03:17:58 UTC; 6min ago
     Docs: https://docs.docker.com
  Process: 2701 ExecStart=/usr/bin/dockerd (code=exited, status=1/FAILURE)
 Main PID: 2701 (code=exited, status=1/FAILURE)

Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.862400156Z" level=warning msg="mountpoint for pids not found"
Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.862532666Z" level=info msg="Loading containers: start."
Feb 15 03:17:57 localhost dockerd[2701]: ..time="2021-02-15T03:17:57.866933410Z" level=info msg="Firewalld running: false"
Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.957136073Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172....P address"
Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.993371676Z" level=info msg="Loading containers: done."
Feb 15 03:17:58 localhost dockerd[2701]: time="2021-02-15T03:17:58.007806591Z" level=fatal msg="Error creating cluster component: error while loading TLS C...yet valid"
Feb 15 03:17:58 localhost systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Feb 15 03:17:58 localhost systemd[1]: Failed to start Docker Application Container Engine.
Feb 15 03:17:58 localhost systemd[1]: Unit docker.service entered failed state.
Feb 15 03:17:58 localhost systemd[1]: docker.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

上面有一个 Error creating cluster component: error while loading TLS C...yet valid 折行了。

  • 4、执行 systemctl status docker.service -l
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2021-02-15 03:17:58 UTC; 11min ago
     Docs: https://docs.docker.com
  Process: 2701 ExecStart=/usr/bin/dockerd (code=exited, status=1/FAILURE)
 Main PID: 2701 (code=exited, status=1/FAILURE)

Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.862400156Z" level=warning msg="mountpoint for pids not found"
Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.862532666Z" level=info msg="Loading containers: start."
Feb 15 03:17:57 localhost dockerd[2701]: ..time="2021-02-15T03:17:57.866933410Z" level=info msg="Firewalld running: false"
Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.957136073Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Feb 15 03:17:57 localhost dockerd[2701]: time="2021-02-15T03:17:57.993371676Z" level=info msg="Loading containers: done."
Feb 15 03:17:58 localhost dockerd[2701]: time="2021-02-15T03:17:58.007806591Z" level=fatal msg="Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid"
Feb 15 03:17:58 localhost systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Feb 15 03:17:58 localhost systemd[1]: Failed to start Docker Application Container Engine.
Feb 15 03:17:58 localhost systemd[1]: Unit docker.service entered failed state.
Feb 15 03:17:58 localhost systemd[1]: docker.service failed.

可以看到 Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509

  • 5、执行 sudo dockerd --debug 有很多信息,包括DEBUG、INFO,最后的是一行 FATA 信息。
...
DEBU[0001] /sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]
DEBU[0001] /sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]
DEBU[0001] /sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]
DEBU[0001] successfully loaded the Root CA: /var/lib/docker/swarm/certificates/swarm-root-ca.crt
FATA[0001] Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid

二、处理过程

  • 1、执行 cd /var/lib/docker/swarm/certificates/
  • 2、执行 ls -lsa
drwxr-xr-x. 2 root root 4096 Feb 10 10:07 .
drwx------. 5 root root   90 Feb 11 02:40 ..
-rw-r--r--. 1 root root 1385 Nov 11 05:18 swarm-node.crt
-rw-------. 1 root root  227 Feb 10 10:07 .swarm-node.key
-rw-------. 1 root root  227 Nov 11 05:18 swarm-node.key
-rw-r--r--. 1 root root  595 Nov 11 05:18 swarm-root-ca.crt
-rw-------. 1 root root  227 Nov 11 05:18 swarm-root-ca.key
  • 3、执行 mv .swarm-node.key bak.key
  • 4、执行 mv swarm-node.crt swarm-node.bak
  • 5、执行 systemctl start docker 执行后,直接返回提示符,没有出错!
  • 6、执行 systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2021-02-15 03:41:29 UTC; 10min ago
     Docs: https://docs.docker.com
 Main PID: 3240 (dockerd)
   Memory: 19.8M
  • 7、进入路径 cd /var/lib/docker/swarm/certificates/
bash: cd: /var/lib/docker/swarm/certificates/: No such file or directory
  • 8、进入路径 cd /var/lib/docker/swarm/ && ls -las
0 drwx------   2 root root    6 Feb 15 03:41 .
4 drwx--x--x. 10 root root 4096 Feb 15 03:41 ..

即:出现这种情况,解决的方法:删除 /var/lib/docker/swarm/certificates/ 就可以。

出现问题的可能:
1、宿主机断电重启;
2、宿主机换了IP网段。