*For the RoCE recommended configuration and verification, please click here .
This post provides guidelines for how to debug the RoCE network and how to tune RoCE performance. The following flowchart describes the process of RoCE troubleshooting.
Information on how to run the tests listed in the flowchart below can be found in the subsequent sections.
Figure 1: RoCE Debug Flow Chart
Test #1 - Check RDMA Connectivity using ibv_rc_pingpong
This test verifies that RoCE traffic can be sent between the client and the server sides. This test does not require rdma-cm to be enabled.
To check the RDMA connectivity, follow the steps below.
On the server
side
- Find the server’s ibdev(s) using:
- rdma command, in case you are working with Upstream.
The output will be a list of the servers’ InfiniBand devices and their matching netdevs.
# rdma link
1/1: mlx5_0/1: state ACTIVE physical_state LINK_UP netdev enp17s0f0
2/1: mlx5_1/1: state ACTIVE physical_state LINK_UP netdev enp17s0f1
3/1: mlx5_2/1: state ACTIVE physical_state LINK_UP netdev enp134s0f0
4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev enp134s0f1
OR:
- ibdev2netdev command, in case you are working with OFED.
The output will be a list of the servers’ InfiniBand devices and their matching netdevs.
# ibdev2netdev
# ibdev2netdev
mlx5_0 port 1 ==> enp17s0f0 (Up)
mlx5_1 port 1 ==> enp17s0f1 (Up)
mlx5_2 port 1 ==> enp134s0f0 (Up)
mlx5_3 port 1 ==> enp134s0f1 (Up)
- Find the netdev’s IP address. Select an InfiniBand device from the previous step to be tested, and find the matching netdev’s IP address.
Note: In the examples to follow, the netdev used is mlx5_1/1, obtained from the previous step.
# ip address show dev enp17s0f1
12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff
inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1
valid_lft forever preferred_lft forever
inet6 fe80::ee0d:9aff:feae:119d/64 scope link
valid_lft forever preferred_lft forever
- Find the netdev’s GID.
On the same device, find the matching netdev’s GID using show_gids command.
# show_gids mlx5_1
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_1 1 0 fe80:0000:0000:0000:ee0d:9aff:feae:119d v1 enp17s0f1
mlx5_1 1 1 fe80:0000:0000:0000:ee0d:9aff:feae:119d v2 enp17s0f1
mlx5_1 1 2 0000:0000:0000:0000:0000:ffff:0c07:9cf0 12.7.156.240 v1 enp17s0f1
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:0c07:9cf0 12.7.156.240 v2 enp17s0f1
n_gids_found=4
- Run ibv_rc_pingpong as server to ensure connectivity is achieved.
Run the rc ping pong server tool using the RoCE v2 GID obtained from the previous step (Index 3 from the table above). This can be done using ibv_rc_pingpong command in case you are working with Upstream.
# ibv_rc_pingpong -d mlx5_1 -g 3
local address: LID 0x0000, QPN 0x003968, PSN 0x3869d8, GID ::ffff:12.7.156.240
remote address: LID 0x0000, QPN 0x001960, PSN 0x39c9d6, GID ::ffff:12.7.156.239
8192000 bytes in 0.01 seconds = 12475.92 Mbit/sec
1000 iters in 0.01 seconds = 5.25 usec/iter
On the client side
- Find the server’s ibdev(s) using:
- rdma command, in case you are working with Upstream.
The output will be a list of the servers’ InfiniBand devices and their matching netdevs.
# rdma link
1/1: mlx5_0/1: state ACTIVE physical_state LINK_DOWN netdev enp17s0f0
2/1: mlx5_1/1: state ACTIVE physical_state LINK_UP netdev enp17s0f1
3/1: mlx5_2/1: state ACTIVE physical_state LINK_DOWN netdev enp134s0f0
4/1: mlx5_3/1: state ACTIVE physical_state LINK_DOWN netdev enp134s0f1
OR
- ibdev2netdev command, in case you are working with OFED.
The output will be a list of the servers’ InfiniBand devices and their matching netdevs.
#ibdev2netdev
mlx5_0 port 1 ==> enp17s0f0 (Down)
mlx5_1 port 1 ==> enp17s0f1 (Up)
mlx5_2 port 1 ==> enp134s0f0 (Down)
mlx5_3 port 1 ==> enp134s0f1 (Down)
Note: In the examples to follow, the netdev used is mlx5_1/1.
- Find the server’s GID using show_gids command.
# show_gids mlx5_1
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_1 1 0 fe80:0000:0000:0000:ee0d:9aff:feae:11e5 v1 enp17s0f1
mlx5_1 1 1 fe80:0000:0000:0000:ee0d:9aff:feae:11e5 v2 enp17s0f1
mlx5_1 1 2 0000:0000:0000:0000:0000:ffff:0c07:9cef 12.7.156.239 v1 enp17s0f1
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:0c07:9cef 12.7.156.239 v2 enp17s0f1
n_gids_found=4
- Run rc ping pong client
Run the rc ping pong client tool using the RoCE v2 GID obtained from the previous step (Index 3 from the table above). This can be done using ibv_rc_pingpong command in case you are working with Upstream.
# [root@l-csi-0124l ~]# ibv_rc_pingpong -d mlx5_1 -g 3 12.7.156.240
local address: LID 0x0000, QPN 0x001960, PSN 0x39c9d6, GID ::ffff:12.7.156.239
remote address: LID 0x0000, QPN 0x003968, PSN 0x3869d8, GID ::ffff:12.7.156.240
8192000 bytes in 0.00 seconds = 14864.14 Mbit/sec
1000 iters in 0.00 seconds = 4.41 usec/iter
[root@l-csi-0124l ~]#
Results
Success criteria: average bandwidth on the client side is larger than 0.
In case the test was completed successfully but you have no RDMA service, please contact Mellanox support with the output of sysinfo snapshot command, which can be downloaded at:
https://github.com/Mellanox/linux-sysinfo-snapshot
In case of failure, check IP connectivity (see the following Test #2: Basic RDMA Check)
Extra info
- More details on the ibv_rc_pingpong command can be found at:
https://linux.die.net/man/1/ibv_rc_pingpong - More details on show_gids command can be found at:
https://community.mellanox.com/s/article/understanding-show-gids-script - More details on ibdev2netdev command can be found at:
https://community.mellanox.com/s/article/ibdev2netdev
Test #2: Basic RDMA Check
This test verifies some basic preconditions for RDMA traffic establishment .
- Check that RoCE is enabled on both the server and the client sides.
# lspci -D | grep Mellanox
0000:11:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:11:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:86:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0000:86:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
# cat /sys/bus/pci/devices/0000\:11\:00.1/roce_enable
1
If RoCE is disabled (roce_enable is set to 0), enable it:
- Using rdma command (in case you are working with Upstream).
# devlink dev param set pci/0000:00:00.1 name enable_roce value 1 cmode runtime
OR
- Using ibdev2netdev command (in case you are working with OFED)
# echo 1 > /sys/bus/pci/devices/0000\:11\:00.1/roce_enable
- Perform an MTU check.
RoCE requires an MTU of at least 1024 bytes for net payload. In the sub-steps below, check for larger MTUs that address additional headers, such as IP header, VLANs, tunnels, etc.
The MTU shall be guaranteed end-to-end without the need to perform segmentation and reassembly.
2.a. Set the MTU value on the server and the client sides:
Verify that the MTU is larger than 1250 Bytes
# ip address show dev enp17s0f1
12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff
inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1
valid_lft forever preferred_lft forever
inet6 fe80::ee0d:9aff:feae:119d/64 scope link
valid_lft forever preferred_lft forever
2.b. Perform an end-to-end MTU check. Ping the server:
# ping -f -c 100 -s 1250 -M do 12.7.156.240
PING 12.7.156.240 (12.7.156.240) 1250(1278) bytes of data.
--- 12.7.156.240 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.003/0.003/0.012/0.001 ms, ipg/ewma 0.008/0.003 ms
Success criteria: Both tests above passed OK. If not, please correct the MTU size
Step 4. Check Device Info by running ibv_devinfo on the server
# ibv_devinfo -d mlx5_1 -vvv
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 16.24.1000
…
GID[ 1]: fe80:0000:0000:0000:ee0d:9aff:feae:119d
GID[ 2]: 0000:0000:0000:0000:0000:ffff:0c07:9cf0
GID[ 3]: 0000:0000:0000:0000:0000:ffff:0c07:9cf0
Success criteria: command succeeded.
Results
If configuration has been updated as a result of the test (such as change of MTU), it means the test is done successfully. In such case, check IP connectivity (Test #3)
If the issue still exists, re-do the steps in Test #1.
In case of failure (command returned with an error, hang, etc,), contact Mellanox support with the output of sysinfo snapshot command which can be downloaded at
https://github.com/Mellanox/linux-sysinfo-snapshot
Extra info
More details on ping command can be found at:
https://linux.die.net/man/8/ping
More details on show_gids command can be found at:
https://community.mellanox.com/s/article/understanding-show-gids-script
More details on ibv_devinfo command can be found at:
https://linux.die.net/man/1/ibv_devinfo
Test #3: Check IP Connectivity using Ping
This test verifies that IP traffic can be sent between the client and the server sides.
On the server side:
Find the server’s IP address by following the second step in Test #1 above.
On the client side:
Ping the server# ping -f -c 100 12.7.156.240
PING 12.7.156.240 (12.7.156.240) 56(84) bytes of data.
--- 12.7.156.240 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.002/0.003/0.015/0.002 ms, ipg/ewma 0.007/0.002 ms
Results
Success criteria: low packet loss, 0% is preferred
On success go to: Contact Mellanox support with the output of sysinfo snapshot command which can be downloaded at
https://github.com/Mellanox/linux-sysinfo-snapshot
Upon failure, go to: Verify IP, Ethernet connectivity (7)
Extra info
More details on the ping command can be found at:
https://linux.die.net/man/8/ping
This test enables you to track down the reason for not having IP connectivity. To check for IP and Ethernet connectivity issues, run the following tests.
Test #4. A: IP connectivity problems might be a result of the interface being down. For that, check the port state by verifying that the physical port is up:
# ip address show dev enp17s0f1
12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff
inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1
valid_lft forever preferred_lft forever
inet6 fe80::ee0d:9aff:feae:119d/64 scope link
valid_lft forever preferred_lft forever
Test #4 B: Make sure that the number of dropped packets does not increase from one run of IP command to another.
Results
Success criteria: Test is done and IP connectivity is resumed.
If the issue still exists, contact Mellanox support and provide them with the output of sysinfo snapshot command which can be downloaded at
https://github.com/Mellanox/linux-sysinfo-snapshot