原文深入挖掘Docker Swarm网络实现,说了一些官方文档没有说的问题,对于学习理解Docker网络大有帮助,

部署

首先部署一个包含两个节点的docker swarm集群,名称为别为node 1与node 2,创建swarm集群的过程不赘述。接下来创建一个overlay网络与三个服务,每个服务只有一个实例,如下:

docker network create --opt encrypted --subnet 100.0.0.0/24 -d overlay net1

docker service create --name redis --network net1 redis
docker service create --name node --network net1 nvbeta/node
docker service create --name nginx --network net1 -p 1080:80 nvbeta/swarm_nginx

以上命令创建了一个典型的三层应用。nginx是前置的负载均衡器,将用户请求流量分发到node服务,node服务是一个web服务,它负责访问redis并将结果通过nginx服务返回给用户。简单起见,只创建一个node实例。

以下是此应用的逻辑视图:

网络

看一下在docker swarm中已经创建的网络:

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
cac91f9c60ff        bridge              bridge              local
b55339bbfab9        docker_gwbridge     bridge              local
fe6ef5d2e8e8        host                host                local
f1nvcluv1xnf        ingress             overlay             swarm
8vty8k3pejm5        net1                overlay             swarm
893a1bbe3118        none                null                local

net1

刚才创建的overlay网络,负责容器之间东西向通信。

docker_gwbridge

由Docker创建的bridge网络,它允许容器与宿主机通信。

ingress

由Docker创建的overlay网络,在Docker swarm中通过此网络向外部世界暴露服务与routing mesh功能。

net1网络

每个服务在创建时都指定了“--network net1”选项,因此每个容器实例必然有一个接口连接到net1网络,查看一下node 1,有两个容器被部署在这个节点上:

$ docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS               NAMES
eb03383913fb        nvbeta/node:latest   "nodemon /src/index.j"   2 hours ago         Up 2 hours          8888/tcp            node.1.10yscmxtoymkvs3bdd4z678w4
434ce2679482        redis:latest         "docker-entrypoint.sh"   2 hours ago         Up 2 hours          6379/tcp            redis.1.1a2l4qmvg887xjpfklp4d6n7y

通过创建与docker网络名称空间的符号链接,查看一下node 1结点上所有的网络名称空间:

$ cd /var/run
$ sudo ln -s /var/run/docker/netns netns
$ sudo ip netns
be663feced43
6e9de12ede80
2-8vty8k3pej
1-f1nvcluv1x
72df0265d4af

对比名称空间ID与Docker swarm中的网络ID,我们猜测net1网络属于2-8vty8k3pej名称空间,net1网络的ID为8vty8k3pej。这个可以通过对比名称空间下的接口与容器中的接口确认。

容器中的接口:

$ docker exec node.1.10yscmxtoymkvs3bdd4z678w4 ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11040: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:65:00:00:03 brd ff:ff:ff:ff:ff:ff
11042: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:12:00:04 brd ff:ff:ff:ff:ff:ff
$ docker exec redis.1.1a2l4qmvg887xjpfklp4d6n7y ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11036: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:65:00:00:08 brd ff:ff:ff:ff:ff:ff
11038: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff

2-8vty8k3pej名称空间下的接口:

$ sudo ip netns exec 2-8vty8k3pej ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 22:37:32:66:b0:48 brd ff:ff:ff:ff:ff:ff
11035: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default
    link/ether 2a:30:95:63:af:75 brd ff:ff:ff:ff:ff:ff
11037: veth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
    link/ether da:84:44:5c:91:ce brd ff:ff:ff:ff:ff:ff
11041: veth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
    link/ether 8a:f9:bf:c1:ec:09 brd ff:ff:ff:ff:ff:ff

注意br0,它是LinuxBridge设备,所有其它接口都连接在它上边,包括vxlan1、veth2、veth3。vxlan1是VTEP类型的Linux网络虚拟化设备,它是br0的从设备,用来实现vxlan功能。

veth2与veth3都是veth类型的虚拟化设备,总是成对出现,其中之一位于名称空间内,另一个位于容器中,并且们于容器中的veth设备ID总比位于namespacew中的ID小数字1。因此,名称空间下的veth2与redis中的eth0是一对,名称空间下的veth3与node中的eth0是一对。

目前我们可以确认网络net1属于名称空间2-8vty8k3pej,基于目前了理的情报,网络拓扑图暂时如下:

node 1

  +-----------+      +-----------+
  |  nodejs   |      |   redis   |
  |           |      |           |
  +--------+--+      +--------+--+
           |                  |
           |                  |
           |                  |
           |                  |
      +----+------------------+-------+ net1
       101.0.0.3          101.0.0.8
       101.0.0.4(vip)     101.0.0.2(vip)

docker_gwbridge网络

对比redis、node容器中的接口与node 1宿主机中的接口。宿主机接口如下:

$ ip link
...
4: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:24:f1:af:e8 brd ff:ff:ff:ff:ff:ff
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 02:42:e4:56:7e:9a brd ff:ff:ff:ff:ff:ff
11039: veth97d586b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
    link/ether 02:6b:d4:fc:8a:8a brd ff:ff:ff:ff:ff:ff
11043: vethefdaa0d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
    link/ether 0a:d5:ac:22:e7:5c brd ff:ff:ff:ff:ff:ff
10876: vethceaaebe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
    link/ether 3a:77:3d:cc:1b:45 brd ff:ff:ff:ff:ff:ff
...

可以看到,有三个veth设备连接到docker_gwbridge,ID分别是11039、11043、10876,可以通过如下命令确认:

$ brctl show
bridge name            bridge id                      STP enabled            interfaces
docker0                8000.0242e4567e9a              no
docker_gwbridge        8000.024224f1afe8              no                     veth97d586b
                                                                             vethceaaebe
                                                                             vethefdaa0d

根据在net1中总结的veth对的匹配规则,我们知道11039与redis中的eth1(11038)是一对,11043与node中的eth1(11042)是一对。现在的网络拓扑图如下:

node 1

  172.18.0.4         172.18.0.3
 +----+------------------+----------------+ docker_gwbridge
      |                  |
      |                  |
      |                  |
      |                  |
   +--+--------+      +--+--------+
   |  nodejs   |      |   redis   |
   |           |      |           |
   +--------+--+      +--------+--+
            |                  |
            |                  |
            |                  |
            |                  |
       +----+------------------+----------+ net1
        101.0.0.3          101.0.0.8
        101.0.0.4(vip)     101.0.0.2(vip)

docker_gwbridge的功能与单机Docker下默认的docker0(也可能叫bridge,取决于Docker版本)类型的网络很像。但是有区别,docker0有连接外网的功能,docker_gwbridge没有,它只负责同宿主机下不同容器这间的通信。当容器使用-p选项连接外网时由另一个叫ingress的网络负责。

ingress网络

再一次列出node 1宿主机下的网络名称空间与Docker swarm中的网络:

$ sudo ip netns
be663feced43
6e9de12ede80
2-8vty8k3pej
1-f1nvcluv1x
72df0265d4af

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
cac91f9c60ff        bridge              bridge              local
b55339bbfab9        docker_gwbridge     bridge              local
fe6ef5d2e8e8        host                host                local
f1nvcluv1xnf        ingress             overlay             swarm
8vty8k3pejm5        net1                overlay             swarm
893a1bbe3118        none                null                local

很明显,ingress网络属于1-f1nvcluv1x名称空间。但是72df0265d4af名称空间是干什么的呢?先看一下72df0265d4af名称空间下的接口:

$ sudo ip netns exec 72df0265d4af ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
10873: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:03 brd ff:ff:ff:ff:ff:ff
    inet 10.255.0.3/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:aff:feff:3/64 scope link
       valid_lft forever preferred_lft forever
10875: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.2/16 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe12:2/64 scope link
       valid_lft forever preferred_lft forever

eth1(10875)与宿主机上的vethceaaebe(10876)是一对,我们也可以知道eth0(10873)
连接到ingress网络,这一点可以通过查询ingress与docker_gwbridge的详细信息佐证:

$ docker network inspect ingress
[
    {
        "Name": "ingress",
        "Id": "f1nvcluv1xnfa0t2lca52w69w",
        "Scope": "swarm",
        "Driver": "overlay",
        ....
        "Containers": {
            "ingress-sbox": {
                "Name": "ingress-endpoint",
                "EndpointID": "3d48dc8b3e960a595e52b256e565a3e71ea035bb5e77ae4d4d1c56cab50ee112",
                "MacAddress": "02:42:0a:ff:00:03",
                "IPv4Address": "10.255.0.3/16",
                "IPv6Address": ""
            }
        },
        ....
    }
]

$ docker network inspect docker_gwbridge
[
    {
        "Name": "docker_gwbridge",
        "Id": "b55339bbfab9bdad4ae51f116b028ad7188534cb05936bab973dceae8b78047d",
        "Scope": "local",
        "Driver": "bridge",
        ....
        "Containers": {
            ....
            "ingress-sbox": {
                "Name": "gateway_ingress-sbox",
                "EndpointID": "0b961253ec65349977daa3f84f079ec5e386fa0ae2e6dd80176513e7d4a8b2c3",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            }
        },
        ....
    }
]

以上输出中endpoints的“MAC/IP”与网络名称空间72df0265d4af中的“MAC/IP”匹配。由此可见,网络名称空间72df0265d4af是为了隐藏容器“ingress-sbox”,“ingress-sbox”有两个接口,一个连接宿主机,另一个连接ingress网络。

Docker Swarm的其中之一特性是"routing mesh",对于向外部暴露端口的容器,无论它实际运行在那个节点,可以通过访问集群中的任何一个节点访问到它,怎么做到的呢?继续深挖容器。

在我们的应用中,只有nginx服务通过将自已的80端口映射到宿主机的1080端口,但nginx并没有运行在node 1节点上。

继续查看node 1

$ sudo iptables -t nat -nvL
...
...
Chain DOCKER-INGRESS (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:1080 to:172.18.0.2:1080
 176K   11M RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0

$ sudo ip netns exec 72df0265d4af iptables -nvL -t nat
...
...
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    9   576 REDIRECT   tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:1080 redir ports 80
...
...
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DOCKER_POSTROUTING  all  --  *      *       0.0.0.0/0            127.0.0.11
   14   896 SNAT       all  --  *      *       0.0.0.0/0            10.255.0.0/16        ipvs to:10.255.0.3

你可以看到,iptables规则直接转发宿主机1080端口的流量到隐藏容器‘ingress-sbox’,然后POSTROUTING链将数据包放在IP地址10.255.0.3上,而其对应的接口则连接到ingress网络上。

注意SNAT规则中的‘ipvs’。‘ipvs’是Linux内核实现的本地负载均衡器:

$ sudo ip netns exec 72df0265d4af iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 144 packets, 12119 bytes)
 pkts bytes target     prot opt in     out     source               destination
   87  5874 MARK       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:1080 MARK set 0x12c
...
...
Chain OUTPUT (policy ACCEPT 15 packets, 936 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MARK       all  --  *      *       0.0.0.0/0            10.255.0.2           MARK set 0x12c
...
...

iptables规则将流标记为0x12c(=300),然后如此配置ipvs:

$ sudo ip netns exec 72df0265d4af ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  300 rr
  -> 10.255.0.5:0                 Masq    1      0          0

在另一个节点上nginx服务的容器被配置了IP地址为10.255.0.5,它是负载均衡管理的唯一后端。全部总结下来,目前全部网络连接如下图所示:

总结 

Docker Swarm网络背后发生了许多很酷的事情,这使得在多宿主机网络下的应用开发变得容易实现,甚至是跨云环境。对低层细节的挖掘有助于在开发时进行问题定位、调试。