This is a Sonnet 3.6 translation of a Chinese article. Please be mindful of potential translation errors.
In the previous section, we completed the channel search within a single machine. Taking ringGraph as an example, this means we found a series of rings within a single machine. Now we need to connect the rings between machines.
For easier understanding, let’s consider a scenario with two machines, each having sixteen cards. The ring for the first machine is:
Then it begins to set ncclTopoRanks, getting the prev and next of the current rank in the ring. The prev of the first rank and next of the last rank are -1. For example, rank6’s prev is 7 and next is 3. It gets the ringRecv and ringSend of the current ring, which are the first and last nodes of the ring. Finally, it copies the found ring once. According to the official issue, this is done to achieve further parallelism to fully utilize bandwidth.
Then through bootstrapAllGather, it obtains global allGather3Data information, calculates the node where the current rank is located and stores it in comm->node, and stores the first rank of each node in nodesFirstRank. For this example:
cpp
1
2
nodesFirstRank[0]:0nodesFirstRank[1]:10
Then it begins to connect the ends of rings from each machine to form a large ring.
ncclResult_tncclTopoPostset(structncclComm*comm,int*firstRanks,structncclTopoRanks**allTopoRanks,int*rings){// Gather data from all ranks
int*ringRecv,*ringSend,*ringPrev,*ringNext,*treeUpRecv,*treeUpSend,*treeDnRecv,*treeDnSend;intnranks=comm->nRanks;intnChannels=comm->nChannels;NCCLCHECK(ncclCalloc(&ringRecv,nranks*MAXCHANNELS));NCCLCHECK(ncclCalloc(&ringSend,nranks*MAXCHANNELS));NCCLCHECK(ncclCalloc(&ringPrev,nranks*MAXCHANNELS));NCCLCHECK(ncclCalloc(&ringNext,nranks*MAXCHANNELS));NCCLCHECK(ncclCalloc(&treeUpRecv,nranks*MAXCHANNELS));NCCLCHECK(ncclCalloc(&treeUpSend,nranks*MAXCHANNELS));NCCLCHECK(ncclCalloc(&treeDnRecv,nranks*MAXCHANNELS));NCCLCHECK(ncclCalloc(&treeDnSend,nranks*MAXCHANNELS));for(inti=0;i<nranks;i++){for(intc=0;c<nChannels;c++){ringRecv[c*nranks+i]=allTopoRanks[i]->ringRecv[c];ringSend[c*nranks+i]=allTopoRanks[i]->ringSend[c];ringPrev[c*nranks+i]=allTopoRanks[i]->ringPrev[c];ringNext[c*nranks+i]=allTopoRanks[i]->ringNext[c];treeUpRecv[c*nranks+i]=allTopoRanks[i]->treeUpRecv[c];treeUpSend[c*nranks+i]=allTopoRanks[i]->treeUpSend[c];treeDnRecv[c*nranks+i]=allTopoRanks[i]->treeDnRecv[c];treeDnSend[c*nranks+i]=allTopoRanks[i]->treeDnSend[c];}}// Connect rings and trees. This should also duplicate the channels.
NCCLCHECK(connectRings(comm,ringRecv,ringSend,ringPrev,ringNext,firstRanks));NCCLCHECK(connectTrees(comm,treeUpRecv,treeUpSend,treeDnRecv,treeDnSend,firstRanks));// Duplicate ringPrev/ringNext for ncclBuildRing
memcpy(ringPrev+nChannels*nranks,ringPrev,nChannels*nranks*sizeof(int));memcpy(ringNext+nChannels*nranks,ringNext,nChannels*nranks*sizeof(int));// Duplication should be complete now
nChannels=comm->nChannels=std::min(MAXCHANNELS,nChannels*2);// Honor NCCL_MIN_NRINGS/NCCL_MAX_NRINGS.
// We permit combining max, then min, to only use the first channels, then duplicate them.
nChannels=comm->nChannels=std::min((int)ncclMaxNchannels(),nChannels);intc;for(c=nChannels;c<ncclMinNchannels();c++){memcpy(ringPrev+c*nranks,ringPrev+(c-nChannels)*nranks,nranks*sizeof(int));memcpy(ringNext+c*nranks,ringNext+(c-nChannels)*nranks,nranks*sizeof(int));memcpy(comm->channels+c,comm->channels+c-nChannels,sizeof(structncclChannel));}nChannels=comm->nChannels=c;// Create rings array and check all is fine
NCCLCHECK(ncclBuildRings(nChannels,rings,comm->rank,comm->nRanks,ringPrev,ringNext));free(ringRecv);free(ringSend);free(ringPrev);free(ringNext);free(treeUpRecv);free(treeUpSend);free(treeDnRecv);free(treeDnSend);returnncclSuccess;}
Here, all channel’s prev, next, send, and recv information is flattened into arrays. For example, recv[0] indicates which rank is receiving from rank0 in the first ring. Then it starts calculating the prev of the first rank and next of the last rank for the current machine.
As shown above, the prev of the current machine’s recv rank is the send rank of the previous machine, and the next of the current machine’s send rank is the recv rank of the next machine. Then ncclBuildRings executes to record ranks into rings according to the large ring order.
ncclResult_tncclBuildRings(intnrings,int*rings,intrank,intnranks,int*prev,int*next){for(intr=0;r<nrings;r++){charprefix[30];intcurrent=rank;for(inti=0;i<nranks;i++){rings[r*nranks+i]=current;current=next[r*nranks+current];}...// Check that all ranks are there
for(inti=0;i<nranks;i++){intfound=0;for(intj=0;j<nranks;j++){if(rings[r*nranks+j]==i){found=1;break;}}if(found==0){WARN("Error : ring %d does not contain rank %d",r,i);returnncclInternalError;}}}returnncclSuccess;}
Using the above example, the first large ring recorded by rank6 is:
At this point, the large ring between machines is established, and each rank knows its previous and next ranks, so actual communication links can be established.
Next, each rank needs to allocate memory for communication. To improve performance, CPU affinity is set before allocating buffers to ensure the allocated memory is local to the current NUMA node.
cpu_set_taffinitySave;sched_getaffinity(0,sizeof(cpu_set_t),&affinitySave);NCCLCHECK(ncclTopoSetAffinity(comm->topo,comm->rank));ncclResult_tncclTopoSetAffinity(structncclTopoSystem*system,intrank){structncclTopoNode*cpu=NULL,*gpu=NULL;for(intg=0;g<system->nodes[GPU].count;g++){if(system->nodes[GPU].nodes[g].gpu.rank==rank){gpu=system->nodes[GPU].nodes+g;// Find closer CPU
intcpuIndex=-1,minHops=0;for(intc=0;c<system->nodes[CPU].count;c++){intnHops=system->nodes[GPU].nodes[g].paths[CPU][c].count;if(cpuIndex==-1||nHops<minHops){cpuIndex=c;minHops=nHops;}}cpu=system->nodes[CPU].nodes+cpuIndex;}}if(cpu==NULL){WARN("Set CPU affinity : unable to find GPU/CPU for rank %d",rank);returnncclInternalError;}// Query the CPU affinity set we were provided
cpu_set_tmask;SYSCHECK(sched_getaffinity(0,sizeof(cpu_set_t),&mask),"sched_getaffinity");// Get the affinity of the CPU close to our GPU.
cpu_set_tcpuMask=cpu->cpu.affinity;cpu_set_tfinalMask;if(ncclParamIgnoreCpuAffinity())// Ignore the CPU affinity set and use the GPU one instead
finalMask=cpuMask;else// Use a subset of the GPU affinity set
CPU_AND(&finalMask,&mask,&cpuMask);// If there is a non empty set, use it to set affinity
if(CPU_COUNT(&finalMask)){charaffinityStr[sizeof(cpu_set_t)*2];NCCLCHECK(ncclCpusetToStr(&finalMask,affinityStr));INFO(NCCL_INIT,"Setting affinity for GPU %d to %s",gpu->gpu.dev,affinityStr);SYSCHECK(sched_setaffinity(0,sizeof(cpu_set_t),&finalMask),"sched_setaffinity");}returnncclSuccess;}
First, it gets the current thread’s CPU affinity and saves it to affinitySave. After buffer allocation, it will restore the affinity using affinitySave.
Then through ncclTopoSetAffinity, it sets CPU affinity. After finding the CPU node corresponding to the current rank, it can get the cores corresponding to that CPU (cpuMask), then get the affinity of the current thread (mask). By default, it takes the intersection of cpuMask and mask as finalMask. If the intersection is not empty, it sets finalMask for the current thread.
Then briefly looking at the ncclChannel data structure: collectives stores communication operations submitted by users to nccl, such as ncclSend, ncclRecv, etc., and ncclColl stores the parameters for these operations. collectives is a circular queue, so collStart points to the start position, and collCount indicates the number of operations in the queue. FifoHead and FifoTail coordinate kernel output data and NET sending data, essentially implementing producer-consumer. ncclPeer stores communication-related information, which we’ll cover in detail later.
structncclRing{// Shortcuts for userRanks[1] and userRanks[n-1]
intprev;// 记录环中当前rank的上一个rank
intnext;// 记录环中当前rank的下一个rank
// Maps an internal nccl index to user-specified rank order. This is necessary
// since we need to know how the user expects data to be ordered across
// devices. Ordered from current device.
int*userRanks;// 以当前rank为起点记录整个环
int*devUserRanks;// device断的userRanks
};structncclChannel{union{struct{structncclRingring;structncclTreetreeUp;structncclTreetreeDn;structncclTreecollTreeUp;structncclTreecollTreeDn;intid;// Communication structures
structncclPeer*peers;structncclPeer*devPeers;// Operation list for aggregation
structncclColl*collectives;intcollStart;intcollCount;intcollFifoHead;// Only used by GPU
intcollFifoTail;// Only used by CPU
};intdata[0x80];};};
Then it begins to initialize the channel. initChannel mainly handles buffer allocation, allocating userRanks and devUserRanks, setting ncclPeer, allocating collectives. Since both host and device will access the collectives data structure, it needs to allocate page-locked memory on the host side through cudaHostAlloc and map it to cuda address space using the cudaHostAllocMapped flag. However, on UVA systems, there’s no significant difference between cudaMallocHost, cudaHostAlloc + cudaHostAllocDefault, and cudaHostAlloc + cudaHostAllocMapped - both host and device can access them.
ncclResult_tinitChannel(structncclComm*comm,intchannelid){structncclChannel*channel=comm->channels+channelid;if(channel->id!=-1)returnncclSuccess;channel->id=channelid;// Ring index to user rank table.
NCCLCHECK(ncclCudaCalloc(&channel->ring.devUserRanks,comm->nRanks));NCCLCHECK(ncclCalloc(&channel->ring.userRanks,comm->nRanks));// Communication structures with peers.
NCCLCHECK(ncclCudaCalloc(&channel->devPeers,comm->nRanks+1));// The extra one rank is for collnet root (i.e. network)
NCCLCHECK(ncclCalloc(&channel->peers,comm->nRanks+1));for(size_ti=0;i<comm->nRanks+1;++i){channel->peers[i].send.comm=comm;channel->peers[i].recv.comm=comm;}// Per-channel operation list.
NCCLCHECK(ncclCudaHostCalloc(&channel->collectives,NCCL_MAX_OPS));returnncclSuccess;}template<typenameT>staticncclResult_tncclCudaHostCalloc(T**ptr,size_tnelem){CUDACHECK(cudaHostAlloc(ptr,nelem*sizeof(T),cudaHostAllocMapped));memset(*ptr,0,nelem*sizeof(T));returnncclSuccess;}
Then starting from the current rank, it writes the ring to userRanks.
cpp14 lines hidden
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
staticncclResult_tsetupChannel(structncclComm*comm,intchannelId,intrank,intnranks,int*ringRanks){TRACE(NCCL_INIT,"rank %d nranks %d",rank,nranks);NCCLCHECK(initChannel(comm,channelId));structncclRing*ring=&comm->channels[channelId].ring;// Reorganize ranks to start with rank.
intshift;for(shift=0;shift<nranks;shift++){if(ringRanks[shift]==rank){break;}}for(inti=0;i<nranks;i++){ring->userRanks[i]=ringRanks[(i+shift)%nranks];}returnncclSuccess;}
Then it executes ncclTransportP2pSetup to establish communication links between the current rank and its prev and next.
At this point, the connection of channels between machines is complete. In the next section, we’ll understand the process of establishing communication links.
NCCL Source Code Study - This article is part of a series.