-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: Transport retry count exceeded on mlx5_0:1/RoCE #6000
Comments
This error means there are too many packet drops on the network layer, |
Hi @yosefe, |
@afernandezody there are basically 2 options to configure RoCE fabric:
Unfortunately, it's not possible to fix the issue without making some kind of change in system configuration. |
@yosefe, @afernandezody, I'm having exactly the same problem. Any update on this issue? Tried to configure the driver to enable lossy RoCE, but this doesn't seem to help. Two of the three indicated options (roce_tx_window_en and roce_slow_restart_en) are not even supported by my NIC (ConnectX-4). So should I go for lossless mode? What I also noticed is that performance seems somewhat better when running the mpirun command with "--map-by dist -mca rmaps_dist_device mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1" as additional arguments. If I understand correctly, this maps processes to cores which are directly connected to the NIC. On our system, this seems to work up to ~8 cores per node. Beyond that, processes are mapped to cores which are not directly connected to the NIC, because 8 is also the number of cores per numa node. We have nodes with two AMD EPYC 7551 CPUs. So 64 cores per node in total. |
@edofrederix for ConnectX-4 NICs, the recommended way is to configure the NIC and network Eth switches to Lossless mode, as described in https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment (option 2). Lossy RoCE (option 1) is supported starting with ConnectX-5 and required only NIC configuration, without a need to touch the switches. |
Hi, |
@yosefe @afernandezody we will investigate our options regarding lossless fabric and see what's possible. Feel free to close the issue -- it doesn't really seem to be UCX related anyway. Thanks for the feedback. |
Hello,
The system is using UCX 1.9.0, MOFED (5.1-2.5.8.0-ol8.2-x86_64), OL8.2, and OpenMPI v4.0.5. Some IB info:
Intra-node jobs run without any issue, but jobs across 2 or more nodes run for maybe 90 seconds or a couple of minutes but the keep crashing with very lengthy error messages:
Not really sure if it's a bug or if something in my network is causing the issue, so any help would be welcomed. Thanks.
The text was updated successfully, but these errors were encountered: