容器 ping 容器ping不通网关

转载

blueice 2024-02-28 19:38:39

文章标签 容器 ping docker 运维 linux Network 文章分类 云原生云计算

本文仅作为学习记录，非商业用途，侵删，如需转载需作者同意。

下面了解下容器网络不通了怎么调试。

一、问题再现

在容器中ping 公网地址不通；
在宿主机上是可以的。

# docker run -d --name if-test centos:8.1.1911 sleep 36000
244d44f94dc2931626194c6fd3f99cec7b7c4bf61aafc6c702551e2c5ca2a371
# docker exec -it if-test bash
 
[root@244d44f94dc2 /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
808: eth0@if809: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
 
[root@244d44f94dc2 /]# ping 39.106.233.176       ### 容器中无法ping通
PING 39.106.233.176 (39.106.233.176) 56(84) bytes of data.
^C
--- 39.106.233.176 ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 185ms
 
[root@244d44f94dc2 /]# exit             ###退出容器
exit
 
# ping 39.106.233.176                        ### 宿主机上可以ping通
PING 39.106.233.176 (39.106.233.176) 56(84) bytes of data.
64 bytes from 39.106.233.176: icmp_seq=1 ttl=78 time=296 ms
64 bytes from 39.106.233.176: icmp_seq=2 ttl=78 time=306 ms
64 bytes from 39.106.233.176: icmp_seq=3 ttl=78 time=303 ms
^C
--- 39.106.233.176 ping statistics ---
4 packets transmitted, 3 received, 25% packet loss, time 7ms
rtt min/avg/max/mdev = 296.059/301.449/305.580/4.037 ms

二、基本概念

容器 ping 容器ping不通网关_linux

可以看到上面，容器和宿主机都有eth0。

容器中的eth0 是这个 Network Namespace 里的网络接口；

宿主机上的eth0对应着真正的物理网卡，可以和外面通讯。

容器 Network Namespace 中数据包最终发送到物理网卡大概步骤如下：

让数据包从容器的 Network Namespace 发送到Host Network Namespace 上
然后是数据包怎么从宿主机上的eth0发送出去

Docker网络文档 kubernetes网络文档

官方文档中介绍了多种容器网络配置的方式。

容器从自己的Network Namespace 连接到 Host Network Namespace 的方法，一般来说就只有两类接口：

veth
macvlan/ipvlan

Docker 启动的容器缺省的网络接口用的也是这个veth，下面先讲这种方式。

下面模拟操作主要是用 ip netns 通过它来对Network Namespace做操作。

# docker run -d --name if-test --network none centos:8.1.1911 sleep 36000
cf3d3105b11512658a025f5b401a09c888ed3495205f31e0a0d78a2036729472
# docker exec -it if-test ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever

如上，启动容器时加上 “–network none” 参数，Network Namespace 里就只有 loopback 网络设备，没有eth0

然后我们在这个容器中建立 veth，建立的命令如下：

pid=$(ps -ef | grep "sleep 36000" | grep -v grep | awk '{print $2}')
echo $pid
ln -s /proc/$pid/ns/net /var/run/netns/$pid
 
# Create a pair of veth interfaces
ip link add name veth_host type veth peer name veth_container
# Put one of them in the new net ns
ip link set veth_container netns $pid
 
# In the container, setup veth_container
ip netns exec $pid ip link set veth_container name eth0
ip netns exec $pid ip addr add 172.17.1.2/16 dev eth0
ip netns exec $pid ip link set eth0 up
ip netns exec $pid ip route add default via 172.17.0.1
 
# In the host, set veth_host up
ip link set veth_host up

解释下，veth的建立过程：
1、找到容器的1号进程的pid，通过/proc/$pid/ns/net 这个文件得到Network Namespace 的ID，这个Network Namespace ID 即是这个进程的，也同时属于这个容器。

2、/var/run/netns/ 目录下建立一个符号链接，指向这个容器的Network Namespace。完成这步操作后，在后面的 “ip netns” 操作里就可以直接用pid的值作为这个容器的 Network Namespace 标识了。

3、然后用ip link 命令建立一对 veth 的虚拟设备接口，分别是 veth_container 和 veth_host。
veth_container 接口会被放在容器 Network Namespace 里
veth_host 会放在宿主机的 Host Namespace 里

4、ip link set veth_container netns $pid 把veth_container 接口放到容器的Network Namespace中

5、ip netns exec $pid ip link set veth_container name eth0 把容器中的veth_container 重新命名为 eth0，因为已经在容器的Network Namespace 中了，就不会和宿主机上的eth0冲突了。

6、
ip netns exec $pid ip addr add 172.17.1.2/16 dev eth0ip netns exec $pid ip link set eth0 upip netns exec $pid ip route add default via 172.17.0.1

对容器中的eth0，我们还要做基本的网络IP 和缺省路由配置。

7、ip link set veth_host up 因为veth_host 已经在宿主机 Host Network Namespace 了，不需要我们做什么。
只需要up 一下就可以了。

上面操作完以后，就建立了一对veth 虚拟接口设备。示意图如下：

容器 ping 容器ping不通网关_linux_02

veth 就是一个虚拟的网络设备，一般都是成对创建，每对设备相互连接的。当每个设备在不同的 Network Namespace 的时候，Namespace之间就可以用这对veth设备来进行通讯了

例如在veth_host 上加一个IP，172.17.1.1/16，然后从容器中就可以ping 通这个IP了。
这也证明了从容器到宿主机可以利用这对 veth 接口来通讯了。

# ip addr add 172.17.1.1/16 dev veth_host
# docker exec -it if-test ping 172.17.1.1
PING 172.17.1.1 (172.17.1.1) 56(84) bytes of data.
64 bytes from 172.17.1.1: icmp_seq=1 ttl=64 time=0.073 ms
64 bytes from 172.17.1.1: icmp_seq=2 ttl=64 time=0.092 ms
^C
--- 172.17.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 30ms
rtt min/avg/max/mdev = 0.073/0.082/0.092/0.013 ms

上面完成了第一步，通过一对veth 虚拟设备，让数据包从容器的 Network Namespace 发送到宿主机的 Host Network Namespace上。

第二步：数据包的到了 Host Network Namespace 后，怎么从宿主机上的 eth0 发送出去呢？
这步就是普通 Linux 节点上数据包转发的问题，解决的办法有：nat 来做个转发，建立Overlay 网络发送，配置 proxy arp 加路由的方法来实现。

Docker 缺省使用的是 bridge + nat 的转发方式，下面解说下这种方式。
其他的配置方式查阅下docker或者kubernetes的手册。

Docker 程序在节点上安装完之后，就会自动建立一个docker0 的 bridge interface，
我们只需要把第一步中建立的 veth_host 这个设备，接入到docker0 这个 bridge 上。

如果之前你在 veth_host 上设置了 IP 的，就需先运行一下"ip addr delete 172.17.1.1/16 dev veth_host"，把 IP 从 veth_host 上删除。

# ip addr delete 172.17.1.1/16 dev veth_host 
ip link set veth_host master docker0

接入到docker0 这个 bridge 上后的效果图如下：

容器 ping 容器ping不通网关_linux_03

容器和docker0 组成了一个子网，docker0 上的IP 就是这个子网的网关IP。

如果让子网通过宿主机上的 eth0 去访问外网的话，加上iptables 规则就可以了：

iptables -P FORWARD ACCEPT

手动测试下，发现从容器中还是ping 不通外网。

三、解决问题

整个过程大概了解了，下面就是排查数据包传到哪个环节的时候中断了。

排查的思路：
容器中继续 ping 外网的IP ，然后在容器的eth0（veth_container）、容器外的veth_host、docker0、宿主机的eth0这整条路径上运行 tcpdump。

测试的tcpdump 结果如下：

容器的eth0：

# ip netns exec $pid tcpdump -i eth0 host 39.106.233.176 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
00:47:29.934294 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 1, length 64
00:47:30.934766 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 2, length 64
00:47:31.958875 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 3, length 64

veth_host：

# tcpdump -i veth_host host 39.106.233.176 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on veth_host, link-type EN10MB (Ethernet), capture size 262144 bytes
00:48:01.654720 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 32, length 64
00:48:02.678752 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 33, length 64
00:48:03.702827 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 34, length 64

docker0：

# tcpdump -i docker0 host 39.106.233.176 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
00:48:20.086841 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 50, length 64
00:48:21.110765 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 51, length 64
00:48:22.134839 IP 172.17.1.2 > 39.106.233.176: ICMP echo request, id 71, seq 52, length 64

host eth0：

# tcpdump -i eth0 host 39.106.233.176 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

上面的结果说明：
ICMP 包已经到docker0了，但是没有到达宿主机的eth0上。

查看 iptables 规则正常

# iptables -L  -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL
 
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
 
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  172.17.0.0/16        anywhere
 
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere            !127.0.0.0/8          ADDRTYPE match dst-type LOCAL
 
Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

因为这里需要做两个网络设备接口之间的数据转发，从docker0 把数据包转到 eth0。
想到 Linux 协议栈里常用的一个参数 ip_forward。

# cat /proc/sys/net/ipv4/ip_forward
0
# echo 1 > /proc/sys/net/ipv4/ip_forward
 
# docker exec -it if-test ping 39.106.233.176
PING 39.106.233.176 (39.106.233.176) 56(84) bytes of data.
64 bytes from 39.106.233.176: icmp_seq=1 ttl=77 time=359 ms
64 bytes from 39.106.233.176: icmp_seq=2 ttl=77 time=346 ms
^C
--- 39.106.233.176 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1ms
rtt min/avg/max/mdev = 345.889/352.482/359.075/6.593 ms

四、重点小结

解决容器和外部通讯问题：

1、数据包从容器的Network Namespace 转发到宿主机的 Host Network Namespace上；
2、数据包到了Host Network Namespace后，还需要让它从宿主机的eth0 上发送出去。

可以使用一对veth 虚拟网络设备实现，数据从容器 Network Namespace 发送到Host Network Namespace 。
使用 bridge + nat 实现让数据包从宿主机的 eth0 发送出去。

排查网络不通的方法，就是在链路上使用tcpdump 来排查。

然后结合内核网络配置参数，路由表信息，防火墙规则。一般都可以定位出根本原因。

五、评论

1、
问题：
1.docker0 和veth连接，是可以理解为docker0是个交换机，所有连接docker0的网卡可以二层通信？

2.为什么连接到docker0，开启forward就通了？能讲一下原理吗？是到达docker0的包会经过postrouting链，然后经过本地路由后，需要走forward链出去，所以需要开forward为accept并且开启forward吗？

回答：
1、对的你可以理解为docker0是一个L2交换机。
2
ip_forward就是打开Linux类似路由器的功能，允许数据包从一个接口进入，根据路由从另外一个接口出去。这个和你说的iptables里的postrouting/forward链没有关系

2、
回答：
127.0.0.1 是 localhost IP, 每个namespace里都有一个。如果从宿主机的host network namespace里去访问127.0.0.1只是 host network namespace里的，不会访问到容器 network namespace的127.0.0.1

3、
问题：
[root@tyy-node06 ~]# ln -s /proc/ $容器 ping 容器ping不通网关_运维_04$ pid
ln: failed to create symbolic link ‘/var/run/netns/24847’: No such file or directory，目录/var/run/netns/不存在要手动创建吗？