Error: Transport retry count exceeded on mlx5_0:1/RoCE #6000

afernandezody · 2020-12-08T22:14:49Z

Hello,
The system is using UCX 1.9.0, MOFED (5.1-2.5.8.0-ol8.2-x86_64), OL8.2, and OpenMPI v4.0.5. Some IB info:

sudo ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.23.1020
        node_guid:                      506b:4b03:00cb:ecd2
        sys_image_guid:                 506b:4b03:00cb:ecd2
        vendor_id:                      0x02c9
        vendor_part_id:                 4121
        hw_ver:                         0x0
        board_id:                       ORC0000000003
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         16.23.1020
        node_guid:                      506b:4b03:00cb:ecd3
        sys_image_guid:                 506b:4b03:00cb:ecd2
        vendor_id:                      0x02c9
        vendor_part_id:                 4121
        hw_ver:                         0x0
        board_id:                       ORC0000000003
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

Intra-node jobs run without any issue, but jobs across 2 or more nodes run for maybe 90 seconds or a couple of minutes but the keep crashing with very lengthy error messages:

mpirun -np ** --mca btl ^openib --mca pml ucx -H node1,node2 executable
[compute001:31114:0:31114] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31114:0:31114] ib_mlx5_log.c:143  DCI QP 0x1525 wqe[0]: SEND s-e [rqpn 0x223e4 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31115:0:31115] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31115:0:31115] ib_mlx5_log.c:143  DCI QP 0x148c wqe[0]: SEND s-e [rqpn 0x22498 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31116:0:31116] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31116:0:31116] ib_mlx5_log.c:143  DCI QP 0x1504 wqe[0]: SEND s-e [rqpn 0x224b9 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31117:0:31117] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31117:0:31117] ib_mlx5_log.c:143  DCI QP 0x1494 wqe[0]: SEND s-e [rqpn 0x2243e rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31118:0:31118] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31118:0:31118] ib_mlx5_log.c:143  DCI QP 0x1464 wqe[0]: SEND s-e [rqpn 0x223e4 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31119:0:31119] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31119:0:31119] ib_mlx5_log.c:143  DCI QP 0x151c wqe[0]: SEND s-e [rqpn 0x22498 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31121:0:31121] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31121:0:31121] ib_mlx5_log.c:143  DCI QP 0x149d wqe[0]: SEND s-e [rqpn 0x224b9 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31120:0:31120] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31120:0:31120] ib_mlx5_log.c:143  DCI QP 0x147c wqe[0]: SEND s-e [rqpn 0x2243e rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31149:0:31149] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31149:0:31149] ib_mlx5_log.c:143  DCI QP 0x156c wqe[0]: SEND s-e [rqpn 0x223e4 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31148:0:31148] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31148:0:31148] ib_mlx5_log.c:143  DCI QP 0x154a wqe[0]: SEND s-e [rqpn 0x22498 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31150:0:31150] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31150:0:31150] ib_mlx5_log.c:143  DCI QP 0x146c wqe[0]: SEND s-e [rqpn 0x224b9 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31151:0:31151] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31151:0:31151] ib_mlx5_log.c:143  DCI QP 0x1509 wqe[0]: SEND s-e [rqpn 0x2243e rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31180:0:31180] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31180:0:31180] ib_mlx5_log.c:143  DCI QP 0x14e4 wqe[0]: SEND s-e [rqpn 0x2248a rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31175:0:31175] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31175:0:31175] ib_mlx5_log.c:143  DCI QP 0x156d wqe[0]: SEND s-e [rqpn 0x2246f rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31178:0:31178] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31178:0:31178] ib_mlx5_log.c:143  DCI QP 0x14ec wqe[0]: SEND s-e [rqpn 0x22430 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31182:0:31182] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31182:0:31182] ib_mlx5_log.c:143  DCI QP 0x154c wqe[0]: SEND s-e [rqpn 0x223a8 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31210:0:31210] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31210:0:31210] ib_mlx5_log.c:143  DCI QP 0x157c wqe[0]: SEND s-e [rqpn 0x22408 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31212:0:31212] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31212:0:31212] ib_mlx5_log.c:143  DCI QP 0x14ce wqe[0]: SEND s-e [rqpn 0x223c2 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31211:0:31211] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31211:0:31211] ib_mlx5_log.c:143  DCI QP 0x1534 wqe[0]: SEND s-e [rqpn 0x22368 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
[compute001:31213:0:31213] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[compute001:31213:0:31213] ib_mlx5_log.c:143  DCI QP 0x1542 wqe[0]: SEND s-e [rqpn 0x224c6 rmac 50:6b:4b:cb:be:ce sgix 0 dgid fe80::526b:4bff:fecb:bece tc 0] [inl len 18]
==== backtrace (tid:  31117) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31117] *** Process received signal ***
[compute001:31117] Signal: Aborted (6)
[compute001:31117] Signal code:  (-6)
[compute001:31117] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f2c288f4b30]
[compute001:31117] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f2c2855680f]
[compute001:31117] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f2c28540c45]
[compute001:31117] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f2c1762d276]
[compute001:31117] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f2c17631b77]
[compute001:31117] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f2c17631cc4]
[compute001:31117] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f2c16f65079]
[compute001:31117] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f2c16f87942]
[compute001:31117] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f2c17bb6bea]
[compute001:31117] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f2c27f9564c]
[compute001:31117] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f2c1490f591]
[compute001:31117] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f2c1487b213]
[compute001:31117] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f2c1487bb60]
[compute001:31117] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f2c0f93be2e]
[compute001:31117] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f2c1491a1d3]
[compute001:31117] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f2c148c6e5d]
[compute001:31117] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f2c1491077a]
[compute001:31117] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f2c14b9de10]
[compute001:31117] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f2c294c93aa]
[compute001:31117] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f2c29507f96]
[compute001:31117] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f2c294adbae]
[compute001:31117] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31117] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f2c285427c3]
[compute001:31117] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31117] *** End of error message ***
==== backtrace (tid:  31212) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
==== backtrace (tid:  31118) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31212] *** Process received signal ***
==== backtrace (tid:  31150) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
==== backtrace (tid:  31178) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31212] Signal: Aborted (6)
[compute001:31212] Signal code:  (-6)
[compute001:31212] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fb33eb52b30]
[compute001:31212] [ 1] [compute001:31118] *** Process received signal ***
[compute001:31118] Signal: Aborted (6)
[compute001:31118] Signal code:  (-6)
[compute001:31118] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fba7fb17b30]
[compute001:31118] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fba7f77980f]
[compute001:31118] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fba7f763c45]
[compute001:31118] [ 3] [compute001:31178] *** Process received signal ***
[compute001:31178] Signal: Aborted (6)
[compute001:31178] Signal code:  (-6)
[compute001:31150] *** Process received signal ***
[compute001:31150] Signal: Aborted (6)
[compute001:31150] Signal code:  (-6)
[compute001:31150] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f26aecc3b30]
[compute001:31150] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fb33e7b480f]
[compute001:31212] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fb33e79ec45]
[compute001:31212] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fb3319e1276]
[compute001:31212] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fb3319e5b77]
[compute001:31212] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fb3319e5cc4]
[compute001:31212] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fb331319079]
[compute001:31212] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fb33133b942]
[compute001:31212] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fb331f6abea]
[compute001:31212] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fb33e1f364c]
[compute001:31212] [10] /lib64/libc.so.6(gsignal+0x10f)[0x7f26ae92580f]
[compute001:31150] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f26ae90fc45]
[compute001:31150] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f26a1c8f276]
[compute001:31150] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f26a1c93b77]
[compute001:31150] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f26a1c93cc4]
[compute001:31150] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f26a15c7079]
[compute001:31150] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f26a15e9942]
[compute001:31150] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f26a2218bea]
[compute001:31150] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f26ae36464c]
[compute001:31150] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f269af23591]
[compute001:31150] [11] ==== backtrace (tid:  31120) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31120] *** Process received signal ***
[compute001:31120] Signal: Aborted (6)
[compute001:31120] Signal code:  (-6)
/usr/lib64/libucs.so.0(+0x54276)[0x7fba6ea00276]
[compute001:31118] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fba6ea04b77]
[compute001:31118] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fba6ea04cc4]
[compute001:31118] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fba6e338079]
[compute001:31118] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fba6e35a942]
[compute001:31118] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fba6ef89bea]
[compute001:31118] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fba7f1b864c]
[compute001:31118] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fba6bce2591]
[compute001:31118] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7fba6bc4e213]
[compute001:31118] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fba6bc4eb60]
[compute001:31118] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fba66b6be2e]
[compute001:31118] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fba6bced1d3]
[compute001:31118] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fba6bc99e5d]
[compute001:31118] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fba6bce377a]
[compute001:31118] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fba6bf70e10]
[compute001:31178] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f275042bb30]
[compute001:31178] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f275008d80f]
[compute001:31178] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f2750077c45]
[compute001:31178] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f273f22c276]
[compute001:31178] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f273f230b77]
[compute001:31178] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f273f230cc4]
[compute001:31178] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f273eb64079]
[compute001:31178] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f273eb86942]
[compute001:31178] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f273f7b5bea]
[compute001:31178] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f274facc64c]
[compute001:31178] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f273c50e591]
[compute001:31178] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fb32ab10591]
[compute001:31212] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7fb32aa7c7d8]
[compute001:31212] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fb32aa7cb60]
[compute001:31212] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fb325ba1e2e]
[compute001:31212] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fb32ab1b1d3]
[compute001:31212] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fb32aac7e5d]
[compute001:31212] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fb32ab1177a]
[compute001:31212] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fb32ad9ee10]
[compute001:31212] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fb33f7273aa]
[compute001:31212] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fb33f765f96]
[compute001:31212] [20] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7f273c47a7d8]
[compute001:31178] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f273c47ab60]
[compute001:31178] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f2737478e2e]
[compute001:31178] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f273c5191d3]
[compute001:31178] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f273c4c5e5d]
[compute001:31178] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f273c50f77a]
[compute001:31178] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f273c79ce10]
[compute001:31178] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f27510003aa]
[compute001:31178] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f275103ef96]
[compute001:31178] [20] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f269ae8f213]
[compute001:31150] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f269ae8fb60]
[compute001:31150] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f2695d11e2e]
[compute001:31150] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f269af2e1d3]
[compute001:31150] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f269aedae5d]
[compute001:31150] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f269af2477a]
[compute001:31150] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f269b1b1e10]
[compute001:31150] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f26af8983aa]
[compute001:31150] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f26af8d6f96]
[compute001:31150] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f26af87cbae]
[compute001:31150] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31150] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f26ae9117c3]
[compute001:31120] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fc30dda4b30]
[compute001:31120] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fc30da0680f]
[compute001:31120] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fc30d9f0c45]
[compute001:31120] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fc300be3276]
[compute001:31120] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fc300be7b77]
[compute001:31120] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fc300be7cc4]
[compute001:31120] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fc30051b079]
[compute001:31120] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fc30053d942]
[compute001:31120] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fc30116cbea]
[compute001:31120] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fc30d44564c]
[compute001:31120] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fc2f9e34591]
[compute001:31120] [11] ==== backtrace (tid:  31175) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31118] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fba806ec3aa]
[compute001:31118] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fba8072af96]
[compute001:31118] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fba806d0bae]
[compute001:31118] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31118] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fba7f7657c3]
[compute001:31118] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31118] *** End of error message ***
/opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fb33f70bbae]
[compute001:31212] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31212] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fb33e7a07c3]
[compute001:31212] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31212] *** End of error message ***
==== backtrace (tid:  31119) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
/opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f2750fe4bae]
[compute001:31178] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31178] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f27500797c3]
[compute001:31178] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31178] *** End of error message ***
[compute001:31150] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31150] *** End of error message ***
/opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7fc2f9da0213]
[compute001:31120] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fc2f9da0b60]
[compute001:31120] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fc2f4de0e2e]
[compute001:31120] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fc2f9e3f1d3]
[compute001:31120] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fc2f9debe5d]
[compute001:31120] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fc2f9e3577a]
[compute001:31120] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fc2fa0c2e10]
[compute001:31120] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fc30e9793aa]
[compute001:31120] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fc30e9b7f96]
[compute001:31120] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fc30e95dbae]
[compute001:31120] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31120] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fc30d9f27c3]
[compute001:31120] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31120] *** End of error message ***
[compute001:31175] *** Process received signal ***
[compute001:31175] Signal: Aborted (6)
[compute001:31175] Signal code:  (-6)
[compute001:31175] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f680f1c9b30]
[compute001:31175] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f680ee2b80f]
[compute001:31175] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f680ee15c45]
[compute001:31175] [ 3] ==== backtrace (tid:  31114) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31114] *** Process received signal ***
[compute001:31114] Signal: Aborted (6)
[compute001:31114] Signal code:  (-6)
==== backtrace (tid:  31116) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31116] *** Process received signal ***
[compute001:31116] Signal: Aborted (6)
[compute001:31116] Signal code:  (-6)
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31119] *** Process received signal ***
[compute001:31119] Signal: Aborted (6)
[compute001:31119] Signal code:  (-6)
[compute001:31119] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f515c6ccb30]
[compute001:31119] [ 1] ==== backtrace (tid:  31149) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31149] *** Process received signal ***
[compute001:31149] Signal: Aborted (6)
[compute001:31149] Signal code:  (-6)
==== backtrace (tid:  31182) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31182] *** Process received signal ***
[compute001:31182] Signal: Aborted (6)
[compute001:31182] Signal code:  (-6)
[compute001:31182] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fd0b8876b30]
[compute001:31182] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fd0b84d880f]
[compute001:31182] /usr/lib64/libucs.so.0(+0x54276)[0x7f6801feb276]
[compute001:31175] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f6801fefb77]
[compute001:31175] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f6801fefcc4]
[compute001:31175] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f6801923079]
[compute001:31175] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f6801945942]
[compute001:31175] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f6802574bea]
[compute001:31175] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f680e86a64c]
[compute001:31175] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f67fb143591]
[compute001:31175] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7f67fb0af7d8]
[compute001:31175] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f67fb0afb60]
[compute001:31175] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f67f6233e2e]
[compute001:31175] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f67fb14e1d3]
[compute001:31175] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f67fb0fae5d]
[compute001:31175] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f67fb14477a]
[compute001:31175] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f67fb3d1e10]
[compute001:31175] [18] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fd0b84c2c45]
[compute001:31182] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fd0a762d276]
[compute001:31182] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fd0a7631b77]
[compute001:31182] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fd0a7631cc4]
[compute001:31182] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fd0a6f65079]
[compute001:31182] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fd0a6f87942]
[compute001:31182] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fd0a7bb6bea]
[compute001:31182] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fd0b7f1764c]
[compute001:31182] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fd0a490f591]
[compute001:31182] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7fd0a487b7d8]
[compute001:31182] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fd0a487bb60]
[compute001:31182] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fd09f8bbe2e]
[compute001:31182] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fd0a491a1d3]
[compute001:31114] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f9c21cc9b30]
[compute001:31114] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f9c2192b80f]
[compute001:31114] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f9c21915c45]
[compute001:31114] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f9c14be3276]
[compute001:31114] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f9c14be7b77]
[compute001:31114] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f9c14be7cc4]
[compute001:31114] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f9c1451b079]
[compute001:31114] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f9c1453d942]
[compute001:31114] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f9c1516cbea]
[compute001:31114] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f9c2136a64c]
[compute001:31114] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f9c0de34591]
[compute001:31114] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f9c0dda0213]
[compute001:31114] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f9c0dda0b60]
[compute001:31114] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f9c08d1ce2e]
[compute001:31114] [14] [compute001:31116] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f26bc80db30]
[compute001:31116] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f26bc46f80f]
[compute001:31116] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f26bc459c45]
[compute001:31116] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f26ab62d276]
[compute001:31116] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f26ab631b77]
[compute001:31116] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f26ab631cc4]
[compute001:31116] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f26aaf65079]
[compute001:31116] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f26aaf87942]
[compute001:31116] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f26abbb6bea]
[compute001:31116] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f26bbeae64c]
[compute001:31116] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f26a890f591]
[compute001:31116] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f26a887b213]
[compute001:31116] [12] /lib64/libc.so.6(gsignal+0x10f)[0x7f515c32e80f]
[compute001:31119] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f515c318c45]
[compute001:31119] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f514b422276]
[compute001:31119] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f514b426b77]
[compute001:31119] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f514b426cc4]
[compute001:31119] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f514ad5a079]
[compute001:31119] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f514ad7c942]
[compute001:31119] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f514b9abbea]
[compute001:31119] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f515bd6d64c]
[compute001:31119] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f5148704591]
[compute001:31119] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f5148670213]
[compute001:31119] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f5148670b60]
[compute001:31119] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f5143738e2e]
[compute001:31119] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f514870f1d3]
[compute001:31119] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f51486bbe5d]
[compute001:31119] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f514870577a]
[compute001:31119] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f5148992e10]
[compute001:31119] [18] [compute001:31149] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7ff5334f7b30]
[compute001:31149] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7ff53315980f]
[compute001:31149] [ 2] /lib64/libc.so.6(abort+0x127)[0x7ff533143c45]
[compute001:31149] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7ff5263c4276]
[compute001:31149] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7ff5263c8b77]
[compute001:31149] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7ff5263c8cc4]
[compute001:31149] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7ff525cfc079]
[compute001:31149] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7ff525d1e942]
[compute001:31149] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7ff52694dbea]
[compute001:31149] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7ff532b9864c]
[compute001:31149] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7ff51f54e591]
[compute001:31149] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7ff51f4ba213]
[compute001:31149] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7ff51f4bab60]
[compute001:31149] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7ff51a538e2e]
[compute001:31149] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7ff51f5591d3]
[compute001:31149] [15] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f680fd9e3aa]
[compute001:31175] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f680fddcf96]
[compute001:31175] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f680fd82bae]
[compute001:31175] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31175] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f680ee177c3]
[compute001:31175] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31175] *** End of error message ***
/opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7ff51f505e5d]
[compute001:31149] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7ff51f54f77a]
[compute001:31149] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7ff51f7dce10]
[compute001:31149] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7ff5340cc3aa]
[compute001:31149] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7ff53410af96]
[compute001:31149] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7ff5340b0bae]
[compute001:31149] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31149] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7ff5331457c3]
[compute001:31149] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31149] *** End of error message ***
[compute001:31182] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fd0a48c6e5d]
[compute001:31182] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fd0a491077a]
[compute001:31182] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fd0a4b9de10]
[compute001:31182] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fd0b944b3aa]
[compute001:31182] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fd0b9489f96]
[compute001:31182] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fd0b942fbae]
[compute001:31182] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31182] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fd0b84c47c3]
[compute001:31182] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31182] *** End of error message ***
/opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f9c0de3f1d3]
[compute001:31114] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f9c0ddebe5d]
[compute001:31114] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f9c0de3577a]
[compute001:31114] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f9c0e0c2e10]
[compute001:31114] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f9c2289e3aa]
[compute001:31114] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f9c228dcf96]
[compute001:31114] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f9c22882bae]
[compute001:31114] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31114] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f9c219177c3]
[compute001:31114] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31114] *** End of error message ***
/opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f26a887bb60]
[compute001:31116] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f26a3839e2e]
[compute001:31116] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f26a891a1d3]
[compute001:31116] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f26a88c6e5d]
[compute001:31116] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f26a891077a]
[compute001:31116] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f26a8b9de10]
[compute001:31116] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f26bd3e23aa]
[compute001:31116] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f26bd420f96]
[compute001:31116] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f26bd3c6bae]
[compute001:31116] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31116] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f26bc45b7c3]
[compute001:31116] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31116] *** End of error message ***
/opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f515d2a13aa]
[compute001:31119] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f515d2dff96]
[compute001:31119] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f515d285bae]
[compute001:31119] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31119] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f515c31a7c3]
[compute001:31119] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31119] *** End of error message ***
==== backtrace (tid:  31211) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
==== backtrace (tid:  31151) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
[compute001:31211] *** Process received signal ***
[compute001:31211] Signal: Aborted (6)
[compute001:31211] Signal code:  (-6)
==== backtrace (tid:  31213) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31151] *** Process received signal ***
[compute001:31211] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fa24b94cb30]
[compute001:31211] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fa24b5ae80f]
[compute001:31211] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fa24b598c45]
[compute001:31211] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fa23a5f7276]
[compute001:31211] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fa23a5fbb77]
[compute001:31151] Signal: Aborted (6)
[compute001:31151] Signal code:  (-6)
[compute001:31213] *** Process received signal ***
[compute001:31213] Signal: Aborted (6)
[compute001:31213] Signal code:  (-6)
[compute001:31213] [ 0] [compute001:31211] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fa23a5fbcc4]
[compute001:31211] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fa239f2f079]
[compute001:31211] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fa239f51942]
[compute001:31211] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fa23ab80bea]
[compute001:31211] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fa24afed64c]
[compute001:31211] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fa2378d9591]
[compute001:31211] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7fa2378457d8]
[compute001:31211] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fa237845b60]
[compute001:31211] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fa2329ace2e]
[compute001:31211] [14] /lib64/libpthread.so.0(+0x12b30)[0x7f2bc5a1cb30]
[compute001:31213] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f2bc567e80f]
[compute001:31213] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f2bc5668c45]
[compute001:31213] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f2bb89e0276]
[compute001:31213] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f2bb89e4b77]
[compute001:31213] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f2bb89e4cc4]
[compute001:31213] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f2bb8318079]
[compute001:31213] [ 7] [compute001:31151] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f05bd3e3b30]
[compute001:31151] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f05bd04580f]
[compute001:31151] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f05bd02fc45]
[compute001:31151] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f05b03d5276]
[compute001:31151] [ 4] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fa2378e41d3]
[compute001:31211] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fa237890e5d]
[compute001:31211] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fa2378da77a]
[compute001:31211] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fa237b67e10]
[compute001:31211] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fa24c5213aa]
[compute001:31211] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fa24c55ff96]
[compute001:31211] [20] /usr/lib64/libucs.so.0(+0x58b77)[0x7f05b03d9b77]
[compute001:31151] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f05b03d9cc4]
[compute001:31151] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f05abb98079]
[compute001:31151] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f05abbba942]
[compute001:31151] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f05b095ebea]
[compute001:31151] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f05bca8464c]
[compute001:31151] [10] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f2bb833a942]
[compute001:31213] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f2bb8f69bea]
[compute001:31213] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f2bc50bd64c]
[compute001:31213] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f2bb1be6591]
[compute001:31213] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7f2bb1b527d8]
[compute001:31213] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f2bb1b52b60]
[compute001:31213] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f2baca71e2e]
[compute001:31213] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f2bb1bf11d3]
[compute001:31213] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f2bb1b9de5d]
[compute001:31213] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f2bb1be777a]
[compute001:31213] [17] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fa24c505bae]
[compute001:31211] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31211] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fa24b59a7c3]
[compute001:31211] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31211] *** End of error message ***
==== backtrace (tid:  31115) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
/opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f2bb1e74e10]
[compute001:31213] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f2bc65f13aa]
[compute001:31213] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f2bc662ff96]
[compute001:31213] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f2bc65d5bae]
[compute001:31213] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31213] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f2bc566a7c3]
[compute001:31213] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
==== backtrace (tid:  31148) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31148] *** Process received signal ***
[compute001:31148] Signal: Aborted (6)
[compute001:31148] Signal code:  (-6)
[compute001:31148] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f0c123bcb30]
/opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f05a9542591]
[compute001:31151] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f05a94ae213]
[compute001:31151] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f05a94aeb60]
[compute001:31151] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f05a442ae2e]
[compute001:31151] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f05a954d1d3]
[compute001:31151] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f05a94f9e5d]
[compute001:31151] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f05a954377a]
[compute001:31151] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f05a97d0e10]
[compute001:31151] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f05bdfb83aa]
[compute001:31151] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f05bdff6f96]
[compute001:31151] [20] ==== backtrace (tid:  31210) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31210] *** Process received signal ***
[compute001:31210] Signal: Aborted (6)
[compute001:31210] Signal code:  (-6)
[compute001:31115] *** Process received signal ***
[compute001:31115] Signal: Aborted (6)
[compute001:31115] Signal code:  (-6)
[compute001:31115] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fe9b3333b30]
[compute001:31115] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fe9b2f9580f]
[compute001:31115] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fe9b2f7fc45]
[compute001:31115] [compute001:31213] *** End of error message ***
[compute001:31148] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f0c1201e80f]
[compute001:31148] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f0c12008c45]
[compute001:31148] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f0c051e1276]
[compute001:31148] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f0c051e5b77]
[compute001:31148] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f0c051e5cc4]
[compute001:31148] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f0c04b19079]
[compute001:31148] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f0c04b3b942]
[compute001:31148] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f0c0576abea]
[compute001:31148] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f0c11a5d64c]
[compute001:31148] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f0bfe2dd591]
[compute001:31148] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f0bfe249213]
[compute001:31148] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f0bfe249b60]
[compute001:31148] [13] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f05bdf9cbae]
[compute001:31151] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31151] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f05bd0317c3]
[compute001:31151] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31151] *** End of error message ***
[compute001:31210] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fea02b7fb30]
[compute001:31210] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fea027e180f]
[compute001:31210] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fea027cbc45]
[compute001:31210] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fe9f59e1276]
[compute001:31210] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fe9f59e5b77]
[compute001:31210] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fe9f59e5cc4]
[compute001:31210] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fe9f5319079]
[compute001:31210] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fe9f533b942]
[compute001:31210] [ 8] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fe9a61ee276]
[compute001:31115] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fe9a61f2b77]
[compute001:31115] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fe9a61f2cc4]
[compute001:31115] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fe9a5b26079]
[compute001:31115] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fe9a5b48942]
[compute001:31115] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fe9a6777bea]
[compute001:31115] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fe9b29d464c]
[compute001:31115] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fe99f347591]
[compute001:31115] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7fe99f2b3213]
[compute001:31115] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fe99f2b3b60]
[compute001:31115] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fe99a398e2e]
[compute001:31115] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f0bf93f2e2e]
[compute001:31148] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f0bfe2e81d3]
[compute001:31148] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f0bfe294e5d]
[compute001:31148] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f0bfe2de77a]
[compute001:31148] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f0bfe56be10]
[compute001:31148] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f0c12f913aa]
[compute001:31148] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f0c12fcff96]
[compute001:31148] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f0c12f75bae]
[compute001:31148] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31148] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f0c1200a7c3]
[compute001:31148] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fe9f5f6abea]
[compute001:31210] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fea0222064c]
[compute001:31210] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fe9eeb10591]
[compute001:31210] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7fe9eea7c7d8]
[compute001:31210] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fe9eea7cb60]
[compute001:31210] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fe9e9be3e2e]
[compute001:31210] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fe9eeb1b1d3]
[compute001:31210] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fe9eeac7e5d]
[compute001:31210] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fe9eeb1177a]
[compute001:31210] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fe9eed9ee10]
[compute001:31210] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fea037543aa]
[compute001:31210] [19] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31148] *** End of error message ***
[14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fe99f3521d3]
[compute001:31115] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fe99f2fee5d]
[compute001:31115] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fe99f34877a]
[compute001:31115] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fe99f5d5e10]
[compute001:31115] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fe9b3f083aa]
[compute001:31115] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fea03792f96]
[compute001:31210] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fea03738bae]
[compute001:31210] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31210] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fea027cd7c3]
[compute001:31210] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31210] *** End of error message ***
/opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fe9b3f46f96]
[compute001:31115] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fe9b3eecbae]
[compute001:31115] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31115] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fe9b2f817c3]
[compute001:31115] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31115] *** End of error message ***
==== backtrace (tid:  31180) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c7d8 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31180] *** Process received signal ***
[compute001:31180] Signal: Aborted (6)
[compute001:31180] Signal code:  (-6)
[compute001:31180] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f0ac7197b30]
[compute001:31180] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f0ac6df980f]
[compute001:31180] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f0ac6de3c45]
[compute001:31180] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f0ab9feb276]
[compute001:31180] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f0ab9fefb77]
[compute001:31180] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f0ab9fefcc4]
[compute001:31180] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f0ab9923079]
[compute001:31180] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f0ab9945942]
[compute001:31180] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f0aba574bea]
[compute001:31180] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f0ac683864c]
[compute001:31180] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f0ab3143591]
[compute001:31180] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c7d8)[0x7f0ab30af7d8]
[compute001:31180] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f0ab30afb60]
[compute001:31180] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f0aae1f1e2e]
[compute001:31180] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f0ab314e1d3]
[compute001:31180] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f0ab30fae5d]
[compute001:31180] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f0ab314477a]
[compute001:31180] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f0ab33d1e10]
[compute001:31180] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f0ac7d6c3aa]
[compute001:31180] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f0ac7daaf96]
[compute001:31180] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f0ac7d50bae]
[compute001:31180] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31180] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f0ac6de57c3]
[compute001:31180] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31180] *** End of error message ***
==== backtrace (tid:  31121) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[compute001:31121] *** Process received signal ***
[compute001:31121] Signal: Aborted (6)
[compute001:31121] Signal code:  (-6)
[compute001:31121] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fa7ed76bb30]
[compute001:31121] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fa7ed3cd80f]
[compute001:31121] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fa7ed3b7c45]
[compute001:31121] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fa7e05db276]
[compute001:31121] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fa7e05dfb77]
[compute001:31121] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fa7e05dfcc4]
[compute001:31121] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fa7dbdb0079]
[compute001:31121] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fa7dbdd2942]
[compute001:31121] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fa7e0b64bea]
[compute001:31121] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fa7ece0c64c]
[compute001:31121] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fa7d975a591]
[compute001:31121] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7fa7d96c6213]
[compute001:31121] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fa7d96c6b60]
[compute001:31121] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fa7d47abe2e]
[compute001:31121] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fa7d97651d3]
[compute001:31121] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fa7d9711e5d]
[compute001:31121] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fa7d975b77a]
[compute001:31121] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fa7d99e8e10]
[compute001:31121] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fa7ee3403aa]
[compute001:31121] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fa7ee37ef96]
[compute001:31121] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fa7ee324bae]
[compute001:31121] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[compute001:31121] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fa7ed3b97c3]
[compute001:31121] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[compute001:31121] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[mgmt:24111:0:24111] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[mgmt:24111:0:24111] ib_mlx5_log.c:143  DCI QP 0x14c4 wqe[0]: SEND s-e [rqpn 0x22459 rmac 0c:42:a1:7d:df:c8 sgix 0 dgid fe80::e42:a1ff:fe7d:dfc8 tc 0] [inl len 18]
==== backtrace (tid:  24111) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[mgmt:24111] *** Process received signal ***
[mgmt:24111] Signal: Aborted (6)
[mgmt:24111] Signal code:  (-6)
[mgmt:24111] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7f8c28289b30]
[mgmt:24111] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f8c27eeb80f]
[mgmt:24111] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f8c27ed5c45]
[mgmt:24111] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7f8c17021276]
[mgmt:24111] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7f8c17025b77]
[mgmt:24111] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7f8c17025cc4]
[mgmt:24111] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7f8c16959079]
[mgmt:24111] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7f8c1697b942]
[mgmt:24111] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7f8c175aabea]
[mgmt:24111] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f8c2792a64c]
[mgmt:24111] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7f8c14303591]
[mgmt:24111] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7f8c1426f213]
[mgmt:24111] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7f8c1426fb60]
[mgmt:24111] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7f8c0f2ede2e]
[mgmt:24111] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7f8c1430e1d3]
[mgmt:24111] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7f8c142bae5d]
[mgmt:24111] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7f8c1430477a]
[mgmt:24111] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f8c14591e10]
[mgmt:24111] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f8c28e5e3aa]
[mgmt:24111] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7f8c28e9cf96]
[mgmt:24111] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f8c28e42bae]
[mgmt:24111] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[mgmt:24111] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f8c27ed77c3]
[mgmt:24111] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[mgmt:24111] *** End of error message ***
[mgmt:24108:0:24108] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[mgmt:24108:0:24108] ib_mlx5_log.c:143  DCI QP 0x155c wqe[0]: SEND s-e [rqpn 0x2239a rmac 0c:42:a1:7d:df:c8 sgix 0 dgid fe80::e42:a1ff:fe7d:dfc8 tc 0] [inl len 18]
==== backtrace (tid:  24108) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[mgmt:24108] *** Process received signal ***
[mgmt:24108] Signal: Aborted (6)
[mgmt:24108] Signal code:  (-6)
[mgmt:24108] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7fb81983db30]
[mgmt:24108] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fb81949f80f]
[mgmt:24108] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fb819489c45]
[mgmt:24108] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7fb80c7dd276]
[mgmt:24108] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7fb80c7e1b77]
[mgmt:24108] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7fb80c7e1cc4]
[mgmt:24108] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7fb80c115079]
[mgmt:24108] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7fb80c137942]
[mgmt:24108] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7fb80cd66bea]
[mgmt:24108] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fb818ede64c]
[mgmt:24108] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7fb8059ca591]
[mgmt:24108] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7fb805936213]
[mgmt:24108] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7fb805936b60]
[mgmt:24108] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7fb800895e2e]
[mgmt:24108] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7fb8059d51d3]
[mgmt:24108] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7fb805981e5d]
[mgmt:24108] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7fb8059cb77a]
[mgmt:24108] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7fb805c58e10]
[mgmt:24108] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7fb81a4123aa]
[mgmt:24108] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7fb81a450f96]
[mgmt:24108] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7fb81a3f6bae]
[mgmt:24108] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[mgmt:24108] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fb81948b7c3]
[mgmt:24108] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[mgmt:24108] *** End of error message ***
[mgmt:24117:0:24117] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[mgmt:24117:0:24117] ib_mlx5_log.c:143  DCI QP 0x1574 wqe[0]: SEND s-e [rqpn 0x22430 rmac 0c:42:a1:7d:df:c8 sgix 0 dgid fe80::e42:a1ff:fe7d:dfc8 tc 0] [inl len 18]
==== backtrace (tid:  24117) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[mgmt:24117] *** Process received signal ***
[mgmt:24117] Signal: Aborted (6)
[mgmt:24117] Signal code:  (-6)
[mgmt:24117] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7faa30165b30]
[mgmt:24117] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7faa2fdc780f]
[mgmt:24117] [ 2] /lib64/libc.so.6(abort+0x127)[0x7faa2fdb1c45]
[mgmt:24117] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7faa1f021276]
[mgmt:24117] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7faa1f025b77]
[mgmt:24117] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7faa1f025cc4]
[mgmt:24117] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7faa1e959079]
[mgmt:24117] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7faa1e97b942]
[mgmt:24117] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7faa1f5aabea]
[mgmt:24117] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7faa2f80664c]
[mgmt:24117] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7faa1c303591]
[mgmt:24117] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7faa1c26f213]
[mgmt:24117] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7faa1c26fb60]
[mgmt:24117] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7faa171cee2e]
[mgmt:24117] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7faa1c30e1d3]
[mgmt:24117] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7faa1c2bae5d]
[mgmt:24117] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7faa1c30477a]
[mgmt:24117] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7faa1c591e10]
[mgmt:24117] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7faa30d3a3aa]
[mgmt:24117] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7faa30d78f96]
[mgmt:24117] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7faa30d1ebae]
[mgmt:24117] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[mgmt:24117] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7faa2fdb37c3]
[mgmt:24117] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[mgmt:24117] *** End of error message ***
[mgmt:24113:0:24113] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[mgmt:24113:0:24113] ib_mlx5_log.c:143  DCI QP 0x1510 wqe[0]: SEND s-e [rqpn 0x223a4 rmac 0c:42:a1:7d:df:c8 sgix 0 dgid fe80::e42:a1ff:fe7d:dfc8 tc 0] [inl len 18]
==== backtrace (tid:  24113) ====
 0 0x00000000000557b9 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000020079 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042942 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024bea ucp_worker_progress()  ???:0
 4 0x000000000003364c opal_progress()  ???:0
 5 0x00000000000b0591 wait_completion()  hcoll_collectives.c:0
 6 0x000000000001c213 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
 7 0x000000000001cb60 comm_allreduce_hcolrte()  ???:0
 8 0x0000000000017e2e hmca_bcol_ucx_p2p_init_query()  ???:0
 9 0x00000000000bb1d3 hmca_bcol_base_init()  ???:0
10 0x0000000000067e5d hmca_coll_ml_init_query()  ???:0
11 0x00000000000b177a hcoll_init_with_opts()  ???:0
12 0x0000000000004e10 mca_coll_hcoll_comm_query()  ???:0
13 0x00000000000983aa mca_coll_base_comm_select()  ???:0
14 0x00000000000d6f96 ompi_mpi_init()  ???:0
15 0x000000000007cbae MPI_Init()  ???:0
16 0x0000000000406694 main()  ???:0
17 0x00000000000237c3 __libc_start_main()  ???:0
18 0x00000000004074fe _start()  ???:0
=================================
[mgmt:24113] *** Process received signal ***
[mgmt:24113] Signal: Aborted (6)
[mgmt:24113] Signal code:  (-6)
[mgmt:24113] [ 0] /lib64/libpthread.so.0(+0x12b30)[0x7feb2feeab30]
[mgmt:24113] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7feb2fb4c80f]
[mgmt:24113] [ 2] /lib64/libc.so.6(abort+0x127)[0x7feb2fb36c45]
[mgmt:24113] [ 3] /usr/lib64/libucs.so.0(+0x54276)[0x7feb1ec02276]
[mgmt:24113] [ 4] /usr/lib64/libucs.so.0(+0x58b77)[0x7feb1ec06b77]
[mgmt:24113] [ 5] /usr/lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x7feb1ec06cc4]
[mgmt:24113] [ 6] /usr/lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x4c9)[0x7feb1e53a079]
[mgmt:24113] [ 7] /usr/lib64/ucx/libuct_ib.so.0(+0x42942)[0x7feb1e55c942]
[mgmt:24113] [ 8] /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a)[0x7feb1f18bbea]
[mgmt:24113] [ 9] /opt/openmpi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7feb2f58b64c]
[mgmt:24113] [10] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0xb0591)[0x7feb1bee4591]
[mgmt:24113] [11] /opt/mellanox/hcoll/lib/libhcoll.so.1(+0x1c213)[0x7feb1be50213]
[mgmt:24113] [12] /opt/mellanox/hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x30)[0x7feb1be50b60]
[mgmt:24113] [13] /opt/mellanox/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_init_query+0xc8e)[0x7feb16f52e2e]
[mgmt:24113] [14] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x43)[0x7feb1beef1d3]
[mgmt:24113] [15] /opt/mellanox/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x6d)[0x7feb1be9be5d]
[mgmt:24113] [16] /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x36a)[0x7feb1bee577a]
[mgmt:24113] [17] /opt/openmpi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7feb1c172e10]
[mgmt:24113] [18] /opt/openmpi/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7feb30abf3aa]
[mgmt:24113] [19] /opt/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xe76)[0x7feb30afdf96]
[mgmt:24113] [20] /opt/openmpi/lib/libmpi.so.40(MPI_Init+0x5e)[0x7feb30aa3bae]
[mgmt:24113] [21] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x406694]
[mgmt:24113] [22] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7feb2fb387c3]
[mgmt:24113] [23] /home/opc/benchmarks/hpcg-3.1/bin/xhpcg[0x4074fe]
[mgmt:24113] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 68 with PID 31210 on node compute001 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Not really sure if it's a bug or if something in my network is causing the issue, so any help would be welcomed. Thanks.

The text was updated successfully, but these errors were encountered:

yosefe · 2020-12-08T22:26:45Z

This error means there are too many packet drops on the network layer,
can you pls try to enable RoCE Lossy configuration, as described in https://community.mellanox.com/s/article/How-to-Enable-Disable-Lossy-RoCE-Accelerations?

afernandezody · 2020-12-17T17:36:27Z

Hi @yosefe,
I apologize for the delay. I tried to understand the link, which led to one thing that led to another and so forth. Firstly, MFT and mst commands don't work on my system probably because it's a virtualized environment (the provider suggested that providing this kind of access to users could lead to undesirable consequences). You comment, however, seems to be on point but the solution might be to provide more specific directions to OpenMPI. Some documentation (e.g. http://openucx.github.io/ucx/faq.html) suggests to specify UCX_IB_TRAFFIC_CLASS and UCX_IB_GID_INDEX. The latter probably depends on the system configuration (it seems that my system requires to set it to 3) but which parameter is right for UCX_IB_TRAFFIC_CLASS has been more difficult to figure out. I have seen 105 and 106 as possible values, but I have been unable to find a list of potential values and what they mean.
Thanks.

yosefe · 2020-12-17T19:13:20Z

@afernandezody there are basically 2 options to configure RoCE fabric:

Lossy mode (probably default in switch and NIC) - Recommended - in this case, the above configuration is required to avoid too many packet drops
Lossless mode - so the fabric will not drop packets.
See more details in https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment.

Unfortunately, it's not possible to fix the issue without making some kind of change in system configuration.
UCX_IB_TRAFFIC_CLASS and UCX_IB_GID_INDEX are not needed since they are set automatically according to fabric configuration.

edofrederix · 2021-01-29T14:56:31Z

@yosefe, @afernandezody, I'm having exactly the same problem. Any update on this issue? Tried to configure the driver to enable lossy RoCE, but this doesn't seem to help. Two of the three indicated options (roce_tx_window_en and roce_slow_restart_en) are not even supported by my NIC (ConnectX-4). So should I go for lossless mode?

What I also noticed is that performance seems somewhat better when running the mpirun command with "--map-by dist -mca rmaps_dist_device mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1" as additional arguments. If I understand correctly, this maps processes to cores which are directly connected to the NIC. On our system, this seems to work up to ~8 cores per node. Beyond that, processes are mapped to cores which are not directly connected to the NIC, because 8 is also the number of cores per numa node. We have nodes with two AMD EPYC 7551 CPUs. So 64 cores per node in total.

yosefe · 2021-01-29T15:31:22Z

@edofrederix for ConnectX-4 NICs, the recommended way is to configure the NIC and network Eth switches to Lossless mode, as described in https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment (option 2). Lossy RoCE (option 1) is supported starting with ConnectX-5 and required only NIC configuration, without a need to touch the switches.
Regarding process mapping - Indeed, it's expected that if the number of MPI ranks is less than the number of cores, would be better to place than close to the NIC.

afernandezody · 2021-01-29T23:09:11Z

Hi,
The timing of the last 2 comments is pretty good as I had a zoom call with the provider earlier today. It turns out that the RoCE configuration (of my cloud system) needs to be lossless (not lossy). The packets are dropping because of authentication issues between the nodes. He advised me to disable the NetworkManager and I will have to reread the documentation for lossless RoCE. Unless @edofrederix wants to keep the thread open, I will close it over the weekend as it has stayed open for a long while.

edofrederix · 2021-01-30T15:35:26Z

@yosefe @afernandezody we will investigate our options regarding lossless fabric and see what's possible. Feel free to close the issue -- it doesn't really seem to be UCX related anyway. Thanks for the feedback.

afernandezody added the Bug label Dec 8, 2020

afernandezody closed this as completed Feb 1, 2021

yosefe pinned this issue Feb 2, 2021

yosefe mentioned this issue Mar 19, 2021

unhandled timeout error #6522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Transport retry count exceeded on mlx5_0:1/RoCE #6000

Error: Transport retry count exceeded on mlx5_0:1/RoCE #6000

afernandezody commented Dec 8, 2020

yosefe commented Dec 8, 2020

afernandezody commented Dec 17, 2020

yosefe commented Dec 17, 2020

edofrederix commented Jan 29, 2021

yosefe commented Jan 29, 2021

afernandezody commented Jan 29, 2021

edofrederix commented Jan 30, 2021

Error: Transport retry count exceeded on mlx5_0:1/RoCE #6000

Error: Transport retry count exceeded on mlx5_0:1/RoCE #6000

Comments

afernandezody commented Dec 8, 2020

yosefe commented Dec 8, 2020

afernandezody commented Dec 17, 2020

yosefe commented Dec 17, 2020

edofrederix commented Jan 29, 2021

yosefe commented Jan 29, 2021

afernandezody commented Jan 29, 2021

edofrederix commented Jan 30, 2021