*For the RoCE recommended configuration and verification, please click here .


This post provides guidelines for how to debug the RoCE network and how to tune RoCE performance. The following flowchart describes the process of RoCE troubleshooting.

Information on how to run the tests listed in the flowchart below can be found in the subsequent sections.


【RDMA】RoCE Debug Flow for Linux(Linux下调试RoCE的流程)_.netFigure 1: RoCE Debug Flow Chart


 

Test #1 - Check RDMA Connectivity using ibv_rc_pingpong



This test verifies that RoCE traffic can be sent between the client and the server sides. This test does not require rdma-cm to be enabled.



To check the RDMA connectivity, follow the steps below.




On the server 

side


  1. Find the server’s ibdev(s) using:
  2. rdma command, in case you are working with Upstream.
    The output will be a list of the servers’ InfiniBand devices and their matching netdevs.

# rdma link



1/1: mlx5_0/1: state ACTIVE physical_state LINK_UP netdev enp17s0f0



2/1: mlx5_1/1: state ACTIVE physical_state LINK_UP netdev enp17s0f1



3/1: mlx5_2/1: state ACTIVE physical_state LINK_UP netdev enp134s0f0



4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev enp134s0f1





OR:



 


  1. ibdev2netdev command, in case you are working with OFED.
    The output will be a list of the servers’ InfiniBand devices and their matching netdevs.


# ibdev2netdev



# ibdev2netdev



mlx5_0 port 1 ==> enp17s0f0 (Up)



mlx5_1 port 1 ==> enp17s0f1 (Up)



mlx5_2 port 1 ==> enp134s0f0 (Up)



mlx5_3 port 1 ==> enp134s0f1 (Up)



 


  1. Find the netdev’s IP address. Select an InfiniBand device from the previous step to be tested, and find the matching netdev’s IP address.


Note: In the examples to follow, the netdev used is mlx5_1/1, obtained from the previous step.



# ip address  show dev enp17s0f1



12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000



    link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff



    inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1



       valid_lft forever preferred_lft forever



    inet6 fe80::ee0d:9aff:feae:119d/64 scope link



       valid_lft forever preferred_lft forever





 


  1. Find the netdev’s GID.

        On the same device, find the matching netdev’s GID using show_gids command.




# show_gids mlx5_1



DEV      PORT     INDEX    GID                                             IPv4              VER      DEV



---      ----     -----    ---                                             ------------      ---      ---



mlx5_1   1        0        fe80:0000:0000:0000:ee0d:9aff:feae:119d                        v1       enp17s0f1



mlx5_1   1        1        fe80:0000:0000:0000:ee0d:9aff:feae:119d                        v2       enp17s0f1



mlx5_1   1        2        0000:0000:0000:0000:0000:ffff:0c07:9cf0    12.7.156.240      v1       enp17s0f1



mlx5_1   1        3        0000:0000:0000:0000:0000:ffff:0c07:9cf0    12.7.156.240      v2       enp17s0f1



n_gids_found=4



 


  1. Run ibv_rc_pingpong as server to ensure connectivity is achieved.

Run the rc ping pong server tool using the RoCE v2 GID obtained from the previous step (Index 3 from the table above). This can be done using ibv_rc_pingpong command in case you are working with Upstream.




# ibv_rc_pingpong -d mlx5_1 -g 3



  local address:  LID 0x0000, QPN 0x003968, PSN 0x3869d8, GID ::ffff:12.7.156.240



  remote address: LID 0x0000, QPN 0x001960, PSN 0x39c9d6, GID ::ffff:12.7.156.239



8192000 bytes in 0.01 seconds = 12475.92 Mbit/sec



1000 iters in 0.01 seconds = 5.25 usec/iter




On the client side


  1. Find the server’s ibdev(s) using:
  2. rdma command, in case you are working with Upstream.

      The output will be a list of the servers’ InfiniBand devices and their matching netdevs.



 



# rdma link



1/1: mlx5_0/1: state ACTIVE physical_state LINK_DOWN netdev enp17s0f0



2/1: mlx5_1/1: state ACTIVE physical_state LINK_UP netdev enp17s0f1



3/1: mlx5_2/1: state ACTIVE physical_state LINK_DOWN netdev enp134s0f0



4/1: mlx5_3/1: state ACTIVE physical_state LINK_DOWN netdev enp134s0f1





OR



 


  1. ibdev2netdev command, in case you are working with OFED.

      The output will be a list of the servers’ InfiniBand devices and their matching netdevs.




#ibdev2netdev



mlx5_0 port 1 ==> enp17s0f0 (Down)



mlx5_1 port 1 ==> enp17s0f1 (Up)



mlx5_2 port 1 ==> enp134s0f0 (Down)



mlx5_3 port 1 ==> enp134s0f1 (Down)




Note: In the examples to follow, the netdev used is mlx5_1/1.


  1. Find the server’s GID using show_gids command.


# show_gids mlx5_1



DEV      PORT     INDEX    GID                                             IPv4              VER      DEV



---      ----     -----    ---                                             ------------      ---      ---



mlx5_1   1        0        fe80:0000:0000:0000:ee0d:9aff:feae:11e5                        v1       enp17s0f1



mlx5_1   1        1        fe80:0000:0000:0000:ee0d:9aff:feae:11e5                        v2       enp17s0f1



mlx5_1   1        2        0000:0000:0000:0000:0000:ffff:0c07:9cef    12.7.156.239      v1       enp17s0f1



mlx5_1   1        3        0000:0000:0000:0000:0000:ffff:0c07:9cef    12.7.156.239      v2       enp17s0f1



n_gids_found=4



 


  1. Run rc ping pong client

Run the rc ping pong client tool using the RoCE v2 GID obtained from the previous step (Index 3 from the table above). This can be done using ibv_rc_pingpong command in case you are working with Upstream.




#  [root@l-csi-0124l ~]# ibv_rc_pingpong -d mlx5_1 -g 3 12.7.156.240



  local address:  LID 0x0000, QPN 0x001960, PSN 0x39c9d6, GID ::ffff:12.7.156.239



  remote address: LID 0x0000, QPN 0x003968, PSN 0x3869d8, GID ::ffff:12.7.156.240



8192000 bytes in 0.00 seconds = 14864.14 Mbit/sec



1000 iters in 0.00 seconds = 4.41 usec/iter



[root@l-csi-0124l ~]#



 


Results

Success criteria: average bandwidth on the client side is larger than 0.



In case the test was completed successfully but you have no RDMA service, please contact Mellanox support with the output of sysinfo snapshot command, which can be downloaded at:  ​


​https://github.com/Mellanox/linux-sysinfo-snapshot​



In case of failure, check IP connectivity (see the following Test #2: Basic RDMA Check)


Extra info


Test #2: Basic RDMA Check

This test verifies some basic preconditions for RDMA traffic establishment .



 


  1. Check that RoCE is enabled on both the server and the client sides.


# lspci -D | grep Mellanox



0000:11:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]



0000:11:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]



0000:86:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]



0000:86:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]




# cat /sys/bus/pci/devices/0000\:11\:00.1/roce_enable



1



If RoCE is disabled (roce_enable is set to 0), enable it:



 


  1. Using rdma command (in case you are working with Upstream).

# devlink dev param set pci/0000:00:00.1 name enable_roce value 1 cmode runtime




OR



 


  1. Using ibdev2netdev command (in case you are working with OFED)


#  echo 1 > /sys/bus/pci/devices/0000\:11\:00.1/roce_enable 



 


  1. Perform an MTU check.

RoCE requires an MTU of at least 1024 bytes for net payload. In the sub-steps below, check for larger MTUs that address additional headers, such as IP header, VLANs, tunnels, etc.



The MTU shall be guaranteed end-to-end without the need to perform segmentation and reassembly.



 


2.a. Set the MTU value on the server and the client sides:

Verify that the MTU is larger than 1250 Bytes



# ip address  show dev enp17s0f1



12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000



    link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff



    inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1



       valid_lft forever preferred_lft forever



    inet6 fe80::ee0d:9aff:feae:119d/64 scope link



       valid_lft forever preferred_lft forever



 


2.b. Perform an end-to-end MTU check. Ping the server:



# ping -f -c 100  -s 1250 -M do  12.7.156.240



PING 12.7.156.240 (12.7.156.240) 1250(1278) bytes of data.



 



--- 12.7.156.240 ping statistics ---



100 packets transmitted, 100 received, 0% packet loss, time 0ms



rtt min/avg/max/mdev = 0.003/0.003/0.012/0.001 ms, ipg/ewma 0.008/0.003 ms




Success criteria: Both tests above passed OK. If not, please correct the MTU size




 


Step 4. Check Device Info by running ibv_devinfo on the server

# ibv_devinfo  -d mlx5_1 -vvv



hca_id:  mlx5_1



          transport:                           InfiniBand (0)



          fw_ver:                               16.24.1000







                             GID[  1]:                   fe80:0000:0000:0000:ee0d:9aff:feae:119d



                             GID[  2]:                   0000:0000:0000:0000:0000:ffff:0c07:9cf0



                             GID[  3]:                   0000:0000:0000:0000:0000:ffff:0c07:9cf0




Success criteria: command succeeded.



 


Results

If configuration has been updated as a result of the test (such as change of MTU), it means the test is done successfully. In such case, check IP connectivity (Test #3)



If the issue still exists, re-do the steps in Test #1.



In case of failure (command returned with an error, hang, etc,), contact Mellanox support with the output of sysinfo snapshot command which can be downloaded at  ​


https://github.com/Mellanox/linux-sysinfo-snapshot


Extra info

More details on ping command can be found at:



https://linux.die.net/man/8/ping



More details on show_gids command can be found at:



https://community.mellanox.com/s/article/understanding-show-gids-script



More details on ibv_devinfo command can be found at:



https://linux.die.net/man/1/ibv_devinfo




Test #3: Check IP Connectivity using Ping

This test verifies that IP traffic can be sent between the client and the server sides.



On the server side:

Find the server’s IP address by following the second step in Test #1 above.



On the client side:

Ping the server# ping -f -c 100 12.7.156.240



PING 12.7.156.240 (12.7.156.240) 56(84) bytes of data.



 



--- 12.7.156.240 ping statistics ---



100 packets transmitted, 100 received, 0% packet loss, time 0ms



rtt min/avg/max/mdev = 0.002/0.003/0.015/0.002 ms, ipg/ewma 0.007/0.002 ms


Results

Success criteria: low packet loss, 0% is preferred  



On success go to: Contact Mellanox support with the output of sysinfo snapshot command which can be downloaded at  ​


https://github.com/Mellanox/linux-sysinfo-snapshot



Upon failure, go to: Verify IP, Ethernet connectivity (‎7)


Extra info

More details on the ping command can be found at:



https://linux.die.net/man/8/ping


Test #4: Verify IP and Ethernet Connectivity

This test enables you to track down the reason for not having IP connectivity.  To check for IP and Ethernet connectivity issues, run the following tests.




Test #4. A: IP connectivity problems might be a result of the interface being down. For that, check the port state by verifying that the physical port is up:



# ip address  show dev enp17s0f1



12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000



    link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff



    inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1



       valid_lft forever preferred_lft forever



    inet6 fe80::ee0d:9aff:feae:119d/64 scope link



       valid_lft forever preferred_lft forever




Test #4 B: Make sure that the number of dropped packets does not increase from one run of IP command to another.


Results

Success criteria: Test is done and IP connectivity is resumed.



If the issue still exists, contact Mellanox support and provide them with the output of sysinfo snapshot command which can be downloaded at  ​


https://github.com/Mellanox/linux-sysinfo-snapshot