-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Hi! I have two servers and every server has 4 P40 GPUs. How to run nccl-test with msccl.
server1(10.0.0.13) <---> (10.0.0.15)server2
I can run successfully in server1 or server2 with their 4 GPUs, but it runs failed and has some errors when I run with server1 and server2, 2 nodes.
-
hostfile:
10.0.0.13 slots=4
10.0.0.15 slots=4 -
my xml file:
python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 1 > test-reduce-8-1.xml -
command:
mpirun --allow-run-as-root -np 8 \
-hostfile hostfile
--prefix /home/nccl-tool/dependency/openmpi
-x LD_LIBRARY_PATH=executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH
-x NCCL_DEBUG=INFO
-x MSCCL_XML_FILES=test-reduce-8-1.xml
-x NCCL_ALGO=MSCCL,RING,TREE
-x NCCL_MSCCL_ENABLE=1
tests/msccl-tests-nccl/build/all_reduce_perf -b 1K -e 1K -f 2 -g 1
- error: ------------------------------------------------------------------------------
ubuntu2004-113:210748:210793 [0] misc/ibvwrap.cc:187 NCCL WARN Call to ibv_modify_qp failed with error Network is unreachable
ubuntu2004-113:210748:210793 [0] NCCL INFO transport/net_ib.cc:579 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO transport/net_ib.cc:786 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO transport/net.cc:730 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO proxy.cc:1310 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO proxy.cc:1381 -> 2
ubuntu2004-113:210748:210793 [0] proxy.cc:1523 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
ubuntu2004-113:210748:210781 [0] misc/socket.cc:29 NCCL WARN socketProgressOpt: Call to recv from 192.168.16.113<47337> failed : Broken pipe
ubuntu2004-113:210748:210781 [0] NCCL INFO misc/socket.cc:46 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO misc/socket.cc:57 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO misc/socket.cc:772 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO proxy.cc:1111 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO transport/net.cc:358 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO transport.cc:174 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO init.cc:1089 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO init.cc:1378 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO group.cc:68 -> 6 [Async thread]
ubuntu2004-113:210748:210748 [0] NCCL INFO group.cc:429 -> 6
ubuntu2004-113:210748:210748 [0] NCCL INFO group.cc:115 -> 6
ubuntu2004-113: Test NCCL failure common.cu:973 'remote process exited or there was a network error / '
.. ubuntu2004-113 pid 210748: Test failure common.cu:857
error log file :
msccl-2nodes-faillog.txt