This is a Sonnet 3.6 translation of a Chinese article. Please be mindful of potential translation errors.
In the last section, we discussed how we calculated the optimal paths from GPU and NIC nodes to any other nodes. This section examines the channel search process in NCCL.
In NCCL, a channel represents a communication path. Multiple channels are used to better utilize bandwidth and network cards, and allow concurrent communication of the same data through multiple channels. Additionally, as we’ll see later, one channel corresponds to one GPU SM. For these reasons, NCCL uses multiple channels, and the search process aims to find a group of channels.
As mentioned in the previous section, in single-machine scenarios, NICs are removed in the ncclTopoTrimSystem function. Let’s first look at the simplified case of a single machine with eight GPUs, and then examine the multi-machine scenario with NICs.
ncclTopoSearchInit initializes system->maxWidth. For single-GPU cases, maxWidth is set to LOC_WIDTH; otherwise, it iterates through each GPU node to check the maximum bandwidth to all other GPU nodes or NICs.
NCCL supports three algorithms for collective communication: ring, tree, and collnet. Here we’ll use ring as an example, with dedicated coverage of ring and tree later.
ncclTopoGraph records the search results, with specific meanings in the comments.
Let’s examine ncclTopoCompute, which is the actual channel search process. The goal is to find as many channels as possible with maximum bandwidth. It’s essentially a brute-force search, starting with strict conditions and gradually relaxing them if no solution is found.
Without NET nodes, crossNic is 0. The graph is initialized with the highest conditions, limiting internal node paths to PATH_NVL and inter-node paths to PATH_PIX. speedIntra and speedInter are set through system->maxWidth, then ncclTopoSearchRec searches for a solution stored in tmpGraph.
If the optimal result is found (channel count equals maxChannel and speedInter equals maxWidth), it exits. Otherwise, it gradually relaxes conditions by:
Setting sameChannel to 0, allowing different channels
ncclResult_tncclTopoCompute(ncclTopoSystem*system,structncclTopoGraph*graph){intngpus=system->nodes[GPU].count;intcrossNic=(system->nodes[NET].count>1)&&graph->crossNic?1:0;graph->speedIntra=graph->speedInter=0;if(graph->crossNic==2)graph->crossNic=0;graph->typeIntra=ngpus==1?PATH_LOC:PATH_NVL;graph->typeInter=PATH_PIX;graph->nChannels=0;graph->sameChannels=1;if(ngpus==1)if(graph->pattern!=NCCL_TOPO_PATTERN_RING)graph->pattern=NCCL_TOPO_PATTERN_TREE;structncclTopoGraphtmpGraph;memcpy(&tmpGraph,graph,sizeof(structncclTopoGraph));// First try crossnic, then decrease speed and finally increase speedIntra.
tmpGraph.pattern=graph->pattern;intpass=1;intspeedIndex=0;while(speedArray[speedIndex]>system->maxWidth&&speedIndex<NSPEEDS-1)speedIndex++;tmpGraph.speedIntra=tmpGraph.speedInter=speedArray[speedIndex];int64_tglobalTimeout=NCCL_SEARCH_GLOBAL_TIMEOUT;search:inttime=tmpGraph.sameChannels?NCCL_SEARCH_TIMEOUT_SAMECHANNELS:tmpGraph.pattern==NCCL_TOPO_PATTERN_TREE?NCCL_SEARCH_TIMEOUT_TREE:NCCL_SEARCH_TIMEOUT;tmpGraph.nChannels=0;globalTimeout-=time;NCCLCHECK(ncclTopoSearchRec(system,&tmpGraph,graph,&time));// Optimal solution, stop here
if(graph->nChannels==graph->maxChannels&&graph->speedInter==system->maxWidth)gotodone;if(pass==1){// First pass, we don't have a solution yet ; try other options
// Try having different channels
if(tmpGraph.sameChannels==1){tmpGraph.sameChannels=0;gotosearch;}tmpGraph.sameChannels=1;if(time!=-1)globalTimeout+=time;elseglobalTimeout=NCCL_SEARCH_GLOBAL_TIMEOUT;if(globalTimeout<0)gotodone;intmaxTypeIntra=system->nodes[NET].count>0?tmpGraph.typeInter:PATH_SYS;if(tmpGraph.typeIntra<maxTypeIntra&&(graph->nChannels==0||tmpGraph.typeIntra<graph->typeIntra)){tmpGraph.typeIntra+=1;gotosearch;}tmpGraph.typeIntra=ngpus==1?PATH_LOC:PATH_NVL;if(system->nodes[NET].count>0&&tmpGraph.typeInter<PATH_SYS&&(graph->nChannels==0||tmpGraph.typeInter<graph->typeInter||tmpGraph.typeInter<PATH_PXB)){tmpGraph.typeInter+=1;gotosearch;}tmpGraph.typeInter=PATH_PIX;// Try a simpler tree
if(tmpGraph.pattern==NCCL_TOPO_PATTERN_SPLIT_TREE_LOOP){tmpGraph.pattern=NCCL_TOPO_PATTERN_SPLIT_TREE;gotosearch;}if(tmpGraph.pattern==NCCL_TOPO_PATTERN_SPLIT_TREE){tmpGraph.pattern=NCCL_TOPO_PATTERN_TREE;gotosearch;}tmpGraph.pattern=graph->pattern;if(crossNic&&tmpGraph.crossNic==0){// Try again with crossNic if permitted
tmpGraph.crossNic=crossNic;gotosearch;}tmpGraph.crossNic=0;// Decrease speed until we find a solution
if((speedIndex<NSPEEDS-1)&&(graph->nChannels==0||(speedArray[speedIndex+1]/graph->speedInter>.49))){tmpGraph.speedInter=tmpGraph.speedIntra=speedArray[++speedIndex];gotosearch;}speedIndex=0;while(speedArray[speedIndex]>system->maxWidth&&speedIndex<NSPEEDS-1)speedIndex++;tmpGraph.speedIntra=tmpGraph.speedInter=speedArray[speedIndex];}done:// We have a solution. Start from that solution and move to pass 2.
if(pass==1){time=-1;memcpy(&tmpGraph,graph,sizeof(tmpGraph));speedIndex=0;while(speedArray[speedIndex]>graph->speedInter&&speedIndex<NSPEEDS-1)speedIndex++;tmpGraph.speedIntra=tmpGraph.speedInter=speedArray[speedIndex];tmpGraph.minChannels=graph->nChannels;pass=2;}// 3. See if we can increase speedIntra for trees (2 nodes or collnet)
if(pass==2){if(time!=0&&graph->pattern!=NCCL_TOPO_PATTERN_RING&&tmpGraph.speedIntra==graph->speedIntra&&tmpGraph.speedIntra<tmpGraph.speedInter*2&&speedIndex>0){tmpGraph.speedIntra=speedArray[--speedIndex];gotosearch;}time=-1;memcpy(&tmpGraph,graph,sizeof(tmpGraph));}if(graph->nChannels==0&&graph->collNet==0){WARN("Could not find a path for pattern %d, falling back to simple order\n",graph->pattern);for(inti=0;i<ngpus;i++)graph->intra[i]=system->nodes[GPU].nodes[i].gpu.rank;graph->inter[0]=graph->inter[1]=0;graph->speedIntra=graph->speedInter=0.1;graph->typeIntra=graph->typeInter=PATH_SYS;graph->nChannels=1;}if(graph->speedIntra>=25.0){intdupChannels=std::min(graph->nChannels*2,graph->maxChannels);memcpy(graph->intra+graph->nChannels*ngpus,graph->intra,(dupChannels-graph->nChannels)*ngpus*sizeof(int));memcpy(graph->inter+graph->nChannels*2,graph->inter,(dupChannels-graph->nChannels)*2*sizeof(int));graph->speedIntra/=DIVUP(dupChannels,graph->nChannels);graph->speedInter/=DIVUP(dupChannels,graph->nChannels);graph->nChannels=dupChannels;}returnncclSuccess;}
The channel search begins, which for ringGraph means finding a series of rings where each rank corresponds to a node in the ring, recording prev and next. This is a backtracking process where each ncclTopoSearchRec execution finds one ring. ncclTopoSearchTryGpu checks if the next selected point is reachable, and ncclTopoSearchRecGpu finds the next GPU.
ncclResult_tncclTopoSearchRec(structncclTopoSystem*system,structncclTopoGraph*graph,structncclTopoGraph*saveGraph,int*time){intbackToNet,backToFirstRank;NCCLCHECK(ncclTopoSearchParams(system,graph->pattern,&backToNet,&backToFirstRank));if(system->nodes[NET].count){// Start from NET
ncclTopoSearchRecNet(system,graph,saveGraph,backToNet,backToFirstRank,time);}else{// Intra-node only.
if(graph->nChannels==0){// Try PCI order first
NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,0,backToNet,backToFirstRank,FORCED_ORDER_PCI,time,-1,-1,0));}else{// Also try to replay previous channel
intg;NCCLCHECK(ncclTopoReplayGetGpu(system,graph,-1,&g));NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,0,backToNet,backToFirstRank,FORCED_ORDER_REPLAY,time,-1,-1,g));}if(graph->sameChannels==0||graph->nChannels==0){// Finally, try all other possibilities unless we are forced to use the same channels
for(intg=0;g<system->nodes[GPU].count;g++){NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,0,backToNet,backToFirstRank,0,time,-1,-1,g));}}}returnncclSuccess;}
ncclTopoSearchParams sets backToNet and backToFirstRank parameters. In single-machine eight-GPU ringGraph scenarios, these are set to -1 and 7 respectively. With nchannel at 0, ncclTopoSearchTryGpu executes in PCI order (devid order), starting from dev0.
ncclTopoSearchTryGpu then executes, checking if the next point is reachable. With type -1, ncclTopoFollowPath sets gpu to card 0, directly executing ncclTopoSearchRecGpu starting from card 0, with step 0.
ncclResult_tncclTopoSearchRecGpu(structncclTopoSystem*system,structncclTopoGraph*graph,structncclTopoGraph*saveGraph,structncclTopoNode*gpu,intstep,intbackToNet,intbackToFirstRank,intforcedOrder,int*time){if((*time)<=0)returnncclSuccess;(*time)--;intngpus=system->nodes[GPU].count;if(step==ngpus){...}graph->intra[graph->nChannels*ngpus+step]=gpu->gpu.rank;intg=gpu-system->nodes[GPU].nodes;if(step==backToNet){...}elseif(step<system->nodes[GPU].count-1){// Go to next GPU
intnext[NCCL_TOPO_MAX_NODES];intcount;if(forcedOrder==FORCED_ORDER_PCI){// Try the PCI order
next[0]=step+1;count=1;}elseif(forcedOrder==FORCED_ORDER_REPLAY){// Try last channel order
NCCLCHECK(ncclTopoReplayGetGpu(system,graph,step,next));count=1;}else{// Normal search
NCCLCHECK(ncclTopoSearchNextGpuSort(system,graph,gpu,next,&count,backToNet==-1?0:backToNet==step+1?1:-1));}for(inti=0;i<count;i++){NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,step+1,backToNet,backToFirstRank,forcedOrder,time,GPU,g,next[i]));}}elseif(step==backToFirstRank){...}else{// Next path
NCCLCHECK(ncclTopoSearchRecGpu(system,graph,saveGraph,gpu,ngpus,-1,-1,forcedOrder,time));}returnncclSuccess;}
ncclTopoSearchRecGpu selects the next node. It first writes card 0 node to the corresponding position in graph->intra. With step 0, it selects the next GPU, with next array showing candidate GPU nodes. With forcedOrder == FORCED_ORDER_PCI, there’s only one candidate: card 1. ncclTopoSearchTryGpu executes for all candidates to check feasibility and continue node selection.
Returning to ncclTopoSearchRec, it checks if card 1 is reachable. ncclTopoFollowPath determines if you can reach from type1’s index1 node to type2’s index2 node. When selecting the starting point, type1 being -1 means node is directly set to type2’s index2. When checking gpu0 to gpu1 accessibility, it gets the path from index1 to index2. For GPU types, speed is set to graph->speedIntra. The mult parameter indicates whether to add or subtract speed from the path - subtracting when searching down the ring, adding when backtracking.
staticncclResult_tncclTopoFollowPath(structncclTopoSystem*system,structncclTopoGraph*graph,inttype1,intindex1,inttype2,intindex2,intmult,structncclTopoNode**node){// First handle easy cases
*node=system->nodes[type2].nodes+index2;if(type1==-1)returnncclSuccess;structncclTopoNode*node1=system->nodes[type1].nodes+index1;structncclTopoLinkList*path=node1->paths[type2]+index2;if(path->count==0)returnncclSuccess;// Now check link type
*node=NULL;intintra=type1==GPU&&type2==GPU;floatspeed=intra?graph->speedIntra:graph->speedInter;inttype=intra?graph->typeIntra:graph->typeInter;if(mult==1&&(path->type>type))returnncclSuccess;speed*=mult;// Check there is enough bandwidth on paths.
intstep=0;NCCLCHECK(followPath(path,node1,path->count,speed,&step));if(step<path->count)gotorewind;// Enough bandwidth : return destination node.
graph->nHops+=mult*path->count;*node=system->nodes[type2].nodes+index2;returnncclSuccess;rewind:// Not enough bandwidth : rewind and exit.
NCCLCHECK(followPath(path,node1,step,-speed,&step));returnncclSuccess;}
It recursively executes ncclTopoSearchRecGpu until gpu7, resulting in the first ring [0,1,2,3,4,5,6,7] in graph->intra. With step at backToFirstRank, it gets the first gpu (gpu0) and checks if 7 to 0 is reachable. If reachable, it continues recursive execution. At step == ngpus, it updates the optimal saveGraph based mainly on total bandwidth (ring count times speedIntra). If the ring count reaches maxChannel, the search ends; otherwise, it continues searching for the next ring.
For the next ring search, returning to ncclTopoSearchRec, it attempts to copy the previous ring. ncclTopoReplayGetGpu gets the step + 1 gpu from the previous ring (gpu0), continuing with ncclTopoSearchTryGpu set to FORCED_ORDER_REPLAY.
ncclResult_tncclTopoSearchRec(structncclTopoSystem*system,structncclTopoGraph*graph,structncclTopoGraph*saveGraph,int*time){{// Also try to replay previous channel
intg;NCCLCHECK(ncclTopoReplayGetGpu(system,graph,-1,&g));NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,0,backToNet,backToFirstRank,FORCED_ORDER_REPLAY,time,-1,-1,g));}}ncclResult_tncclTopoReplayGetGpu(structncclTopoSystem*system,structncclTopoGraph*graph,intstep,int*g){*g=-1;if(graph->nChannels==0)returnncclInternalError;intngpus=system->nodes[GPU].count;intnextRank=graph->intra[(graph->nChannels-1)*ngpus+step+1];for(inti=0;i<ngpus;i++)if(system->nodes[GPU].nodes[i].gpu.rank==nextRank){*g=i;returnncclSuccess;}if(*g==-1)returnncclInternalError;returnncclSuccess;}
FORCED_ORDER_REPLAY uses ncclTopoReplayGetGpu to get the corresponding step’s gpu from the previous ring, essentially copying the previous ring.
ncclResult_tncclTopoSearchRecGpu(structncclTopoSystem*system,structncclTopoGraph*graph,structncclTopoGraph*saveGraph,structncclTopoNode*gpu,intstep,intbackToNet,intbackToFirstRank,intforcedOrder,int*time){...elseif(step<system->nodes[GPU].count-1){// Go to next GPU
intnext[NCCL_TOPO_MAX_NODES];intcount;if(forcedOrder==FORCED_ORDER_PCI){// Try the PCI order
next[0]=step+1;count=1;}elseif(forcedOrder==FORCED_ORDER_REPLAY){// Try last channel order
NCCLCHECK(ncclTopoReplayGetGpu(system,graph,step,next));count=1;}else{// Normal search
NCCLCHECK(ncclTopoSearchNextGpuSort(system,graph,gpu,next,&count,backToNet==-1?0:backToNet==step+1?1:-1));}for(inti=0;i<count;i++){NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,step+1,backToNet,backToFirstRank,forcedOrder,time,GPU,g,next[i]));}}...}
This completes the first search. As mentioned, if results don’t meet conditions, conditions are gradually relaxed for continued searching.
In multi-machine scenarios (e.g., two machines, sixteen cards), with NICs present, ncclTopoSearchParams sets backToFirstRank = -1 and backToNet = 7, with ncclTopoSearchRec directly executing ncclTopoSearchRecNet.
cpp6 lines hidden
1
2
3
4
5
6
7
8
9
ncclResult_tncclTopoSearchRec(structncclTopoSystem*system,structncclTopoGraph*graph,structncclTopoGraph*saveGraph,int*time){intbackToNet,backToFirstRank;NCCLCHECK(ncclTopoSearchParams(system,graph->pattern,&backToNet,&backToFirstRank));if(system->nodes[NET].count){// Start from NET
ncclTopoSearchRecNet(system,graph,saveGraph,backToNet,backToFirstRank,time);}...}
ncclTopoSearchRecNet searches for a solution by trying each NIC as a starting point. Starting with NIC 0, it writes 0 to the first channel in inter, reduces NIC 0’s bandwidth by speedInter, decrements maxChannel, then searches for a ring through ncclTopoSearchTryGpu.
ncclResult_tncclTopoSearchRecNet(structncclTopoSystem*system,structncclTopoGraph*graph,structncclTopoGraph*saveGraph,intbackToNet,intbackToFirstRank,int*time){constintspeed=graph->speedInter;for(intn=0;n<system->nodes[NET].count;n++){structncclTopoNode*net=system->nodes[NET].nodes+n;structncclTopoNode*gpu;if(graph->collNet&&net->net.collSupport==0)continue;if(net->net.width<speed)continue;if(net->net.maxChannels==0)continue;graph->inter[graph->nChannels*2]=net->id;for(inti=0;i<system->nodes[NET].count;i++){if((system->nodes[NET].nodes[i].net.asic==net->net.asic)&&(system->nodes[NET].nodes[i].net.port==net->net.port)){system->nodes[NET].nodes[i].net.width-=speed;}}net->net.maxChannels--;// First try to replay the last channel
if(graph->nChannels>0){intg;NCCLCHECK(ncclTopoReplayGetGpu(system,graph,-1,&g));NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,0,backToNet,backToFirstRank,FORCED_ORDER_REPLAY,time,NET,n,g));}if(graph->nChannels==0||graph->sameChannels==0){if(graph->nChannels==0){// Always try the PCI order first to set a reference, but don't count in the timeout nor let it run for long
intt=1<<10;NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,0,backToNet,backToFirstRank,FORCED_ORDER_PCI,&t,NET,n,0));if(t==-1)*time=-1;}...}returnncclSuccess;}
ncclTopoSearchTryGpu calls ncclTopoSearchRecGpu, recursively executing to fill graph->intra until all GPU nodes are traversed. At step 7 (backToNet), it retrieves the starting NIC (NIC 0). With crossNic support, any legal NIC works; otherwise, it checks if NIC 0 is legal. If legal, NIC 0 fills graph->inter, completing one ring. One unclear point is that after selecting the exit NIC, its bandwidth isn’t reduced by speed.
Back in ncclTopoSearchRecNet, it attempts to copy the searched ring. After finding a solution, it returns to the first ncclTopoSearchRecNet, trying to search from the GPU closest to NIC 0 (GPUn) instead of GPU0. It first checks if bidirectional bandwidth between GPUn and PCIe switch is available. This differs from the comments suggesting a GPU shouldn’t both send and receive (the bandwidth impact here is unclear).
ncclResult_tncclTopoSearchRecNet(structncclTopoSystem*system,structncclTopoGraph*graph,structncclTopoGraph*saveGraph,intbackToNet,intbackToFirstRank,int*time){constintspeed=graph->speedInter;for(intn=0;n<system->nodes[NET].count;n++){...// Then try the most local GPUs
floatmaxWidth=0;intminHops=0xfffffff;structncclTopoLinkList*paths=net->paths[GPU];for(intg=0;g<system->nodes[GPU].count;g++){if(paths[g].width>maxWidth){maxWidth=paths[g].width;minHops=paths[g].count;}elseif(paths[g].width==maxWidth&&paths[g].count<minHops){minHops=paths[g].count;}}if(maxWidth>=speed){// In the first loop, avoid using GPUs in both directions between channels (one channel
// sending from that GPU and one channel receiving to that GPU), since that usually leads
// to lower BW.
for(inttryGpuBidir=0;tryGpuBidir<2;tryGpuBidir++){for(intg=0;g<system->nodes[GPU].count;g++){if(paths[g].width==maxWidth&&paths[g].count==minHops){gpu=system->nodes[GPU].nodes+g;intgpuUsed=gpuPciWidth(gpu)>0?0:1;if(tryGpuBidir==gpuUsed){NCCLCHECK(ncclTopoSearchTryGpu(system,graph,saveGraph,0,backToNet,backToFirstRank,0,time,NET,n,g));}}}}}}net->net.maxChannels++;for(inti=0;i<system->nodes[NET].count;i++){if((system->nodes[NET].nodes[i].net.asic==net->net.asic)&&(system->nodes[NET].nodes[i].net.port==net->net.port)){system->nodes[NET].nodes[i].net.width+=speed;}}}returnncclSuccess;}
This completes the channel search. In summary, this section describes searching for channels based on machine topology for data communication, recording results in ncclTopoGraph.
NCCL Source Code Study - This article is part of a series.