This is a Sonnet 3.6 translation of a Chinese article. Please be mindful of potential translation errors.
Last time we analyzed how NCCL performs topology analysis of the machine’s PCI system, producing results in XML format. Next, NCCL will establish a graph based on this XML to facilitate subsequent path searching.
At the end of ncclTopoGetSystem, ncclTopoGetSystemFromXml is executed to convert the XML format into a graph format
First, it gets the root node “system” from the XML, then traverses the “cpu” child nodes. For each CPU, graph construction is performed through ncclTopoAddCpu. Here, each CPU actually represents a NUMA node.
Then it creates a CPU node with numaid as the ID, sets the CPU’s affinity (the cores corresponding to that NUMA), and sets CPU vendor information and other details.
Next, it traverses the CPU node’s child nodes and executes different functions based on their types. If it’s a PCI node, it executes ncclTopoAddPci
ncclResult_tncclTopoAddPci(structncclXmlNode*xmlPci,structncclTopoSystem*system,structncclTopoNode*parent){constchar*str;inttype;NCCLCHECK(xmlGetAttrStr(xmlPci,"class",&str));NCCLCHECK(kvConvertToInt(str,&type,kvDictPciClass));int64_tbusId;NCCLCHECK(xmlGetAttrStr(xmlPci,"busid",&str));NCCLCHECK(busIdToInt64(str,&busId));structncclTopoNode*node=NULL;if(type==GPU){structncclXmlNode*xmlGpu;NCCLCHECK(xmlGetSub(xmlPci,"gpu",&xmlGpu));if(xmlGpu==NULL)returnncclSuccess;intindex;NCCLCHECK(xmlGetAttrIndex(xmlGpu,"rank",&index));if(index==-1)returnncclSuccess;NCCLCHECK(ncclTopoCreateNode(system,&node,type,busId));NCCLCHECK(ncclTopoAddGpu(xmlGpu,system,node));}if(type==NIC){structncclXmlNode*xmlNic;NCCLCHECK(xmlGetSub(xmlPci,"nic",&xmlNic));if(xmlNic==NULL)returnncclSuccess;// Ignore sub device ID and merge multi-port NICs into one PCI device.
busId&=0xfffffffffffffff0;structncclTopoNode*nicNode=NULL;NCCLCHECK(ncclTopoGetNode(system,&nicNode,type,busId));if(nicNode==NULL){NCCLCHECK(ncclTopoCreateNode(system,&nicNode,type,busId));node=nicNode;// Connect it to parent later on
}NCCLCHECK(ncclTopoAddNic(xmlNic,system,nicNode));}elseif(type==PCI){NCCLCHECK(ncclTopoCreateNode(system,&node,type,busId));for(ints=0;s<xmlPci->nSubs;s++){structncclXmlNode*xmlSubPci=xmlPci->subs[s];NCCLCHECK(ncclTopoAddPci(xmlSubPci,system,node));}}if(node){intwidth,speed;NCCLCHECK(xmlGetAttrInt(xmlPci,"link_width",&width));NCCLCHECK(xmlGetAttrStr(xmlPci,"link_speed",&str));// Manage cases where speed was not indicated in /sys
if(width==0)width=16;NCCLCHECK(kvConvertToInt(str,&speed,kvDictPciGen));// Values in 100Mbps, per lane (we want GB/s in the end)
NCCLCHECK(ncclTopoConnectNodes(node,parent,LINK_PCI,width*speed/80.0));NCCLCHECK(ncclTopoConnectNodes(parent,node,LINK_PCI,width*speed/80.0));}returnncclSuccess;}
First, it gets the PCI type and busId, then checks the type. If it’s PCI, it creates a PCI node and recursively executes ncclTopoAddPci until it encounters NIC or GPU XML nodes.
If it encounters a NIC, it creates a NIC node and executes ncclTopoAddNic. This will traverse XML net nodes under the XML NIC, creating net nodes for each XML net with dev as the ID, then sets speed, port, GDR, and other properties
ncclResult_tncclTopoAddNet(structncclXmlNode*xmlNet,structncclTopoSystem*system,structncclTopoNode*nic){intdev;NCCLCHECK(xmlGetAttrInt(xmlNet,"dev",&dev));structncclTopoNode*net;NCCLCHECK(ncclTopoCreateNode(system,&net,NET,dev));constchar*str;NCCLCHECK(xmlGetAttr(xmlNet,"guid",&str));if(str)sscanf(str,"0x%lx",&net->net.asic);elsenet->net.asic=dev;ncclDebugNoWarn=NCCL_GRAPH;intmbps;if(xmlGetAttrInt(xmlNet,"speed",&mbps)!=ncclSuccess)mbps=0;if(mbps<=0)mbps=10000;// Some NICs define speed = -1
net->net.width=mbps/8000.0;if(xmlGetAttrInt(xmlNet,"port",&net->net.port)!=ncclSuccess)net->net.port=0;if(xmlGetAttrInt(xmlNet,"gdr",&net->net.gdrSupport)!=ncclSuccess)net->net.gdrSupport=0;if(xmlGetAttrInt(xmlNet,"maxconn",&net->net.maxChannels)!=ncclSuccess)net->net.maxChannels=MAXCHANNELS;if(xmlGetAttrInt(xmlNet,"coll",&net->net.collSupport)!=ncclSuccess)net->net.collSupport=0;ncclDebugNoWarn=0;NCCLCHECK(ncclTopoConnectNodes(nic,net,LINK_NET,net->net.width));NCCLCHECK(ncclTopoConnectNodes(net,nic,LINK_NET,net->net.width));returnncclSuccess;}ncclResult_tncclTopoAddNic(structncclXmlNode*xmlNic,structncclTopoSystem*system,structncclTopoNode*nic){for(ints=0;s<xmlNic->nSubs;s++){structncclXmlNode*xmlNet=xmlNic->subs[s];if(strcmp(xmlNet->name,"net")!=0)continue;intindex;NCCLCHECK(xmlGetAttrIndex(xmlNet,"dev",&index));if(index==-1)continue;NCCLCHECK(ncclTopoAddNet(xmlNet,system,nic));}returnncclSuccess;}
Then it establishes bidirectional edges between net nodes and NIC nodes, sets edge types and cumulative bandwidth, and sorts the current node’s edges by bandwidth in descending order
ncclResult_tncclTopoConnectNodes(structncclTopoNode*node,structncclTopoNode*remNode,inttype,floatwidth){// Aggregate links into higher width for NVLink
structncclTopoLink*link;for(link=node->links;link->remNode;link++){if(link->remNode==remNode&&link->type==type)break;}if(link->remNode==NULL)node->nlinks++;link->type=type;link->remNode=remNode;link->width+=width;// Sort links in BW descending order
structncclTopoLinklinkSave;memcpy(&linkSave,link,sizeof(structncclTopoLink));while(link!=node->links){if((link-1)->width>=linkSave.width)break;memcpy(link,link-1,sizeof(structncclTopoLink));link--;}memcpy(link,&linkSave,sizeof(structncclTopoLink));returnncclSuccess;}
At this point, NIC addition is complete. Back in ncclTopoAddPci, if it’s a GPU, it creates a GPU node and sets its rank, dev, GDR, and other properties. Finally, it establishes bidirectional edges between the current node and child nodes through ncclTopoConnectNodes.
This completes the graph construction under each NUMA node, then it starts adding NVLink and QPI connections. Let’s look at NVLink first
ncclResult_tncclTopoAddNvLinks(structncclXmlNode*node,structncclTopoSystem*system,constchar*parentBusId){if(strcmp(node->name,"nvlink")==0){structncclTopoNode*gpu=NULL;int64_tpBusId;NCCLCHECK(busIdToInt64(parentBusId,&pBusId));NCCLCHECK(ncclTopoGetNode(system,&gpu,GPU,pBusId));if(gpu==NULL){WARN("Add NVLink error : could not find GPU %lx\n",pBusId);returnncclInternalError;}intcount;NCCLCHECK(xmlGetAttrInt(node,"count",&count));constchar*targetClass;NCCLCHECK(xmlGetAttrStr(node,"tclass",&targetClass));inttargetType;NCCLCHECK(kvConvertToInt(targetClass,&targetType,kvDictPciClass));structncclTopoNode*remote=NULL;if(targetType==GPU){// NVL P2P connection to another GPU
constchar*target;NCCLCHECK(xmlGetAttrStr(node,"target",&target));int64_tbusId;NCCLCHECK(busIdToInt64(target,&busId));NCCLCHECK(ncclTopoGetNode(system,&remote,GPU,busId));}elseif(targetType==CPU){// NVL connection to the local CPU
NCCLCHECK(findLocalCpu(gpu,&remote));}else{if(system->nodes[NVS].count==0){NCCLCHECK(ncclTopoCreateNode(system,&remote,NVS,0));}else{remote=system->nodes[NVS].nodes;}}if(remote){intnvlSpeed=gpu->gpu.cudaCompCap==60?PASCAL_NVLINK_WIDTH:VOLTA_NVLINK_WIDTH;NCCLCHECK(ncclTopoConnectNodes(gpu,remote,LINK_NVL,count*nvlSpeed));if(remote->type!=GPU){NCCLCHECK(ncclTopoConnectNodes(remote,gpu,LINK_NVL,count*nvlSpeed));}}}else{constchar*busId;NCCLCHECK(xmlGetAttr(node,"busid",&busId));for(ints=0;s<node->nSubs;s++){NCCLCHECK(ncclTopoAddNvLinks(node->subs[s],system,busId?busId:parentBusId));}}returnncclSuccess;}
It recursively traverses from the root node until encountering NVLink XML nodes, then gets the NVLink’s parent node (GPU node). Through tclass, it gets the peer PCI device type. If it’s a GPU or CPU, it directly returns the peer node. If it’s an NVSwitch, it first creates an NVSwitch node, then creates bidirectional edges between the current GPU node and the peer. Then it connects CPUs pairwise through ncclTopoConnectCpus.
Finally, to facilitate subsequent channel searching, it uses ncclTopoSort to recursively sort each PCI node’s edges in the order of NVLink, downward PCI connections, upward PCI connections, and QPI. Since edges were already sorted by bandwidth during edge creation, NVLinks are always first and QPI always last, so only the middle PCI connections need sorting
staticncclResult_tncclTopoSort(structncclTopoNode*node,structncclTopoNode*upNode){// Shift all links to have upLink as last link
if(upNode){intl=0;while(node->links[l].remNode!=upNode)l++;structncclTopoLinkupLink;memcpy(&upLink,node->links+l,sizeof(structncclTopoLink));while(node->links[l+1].remNode){memcpy(node->links+l,node->links+l+1,sizeof(structncclTopoLink));l++;}memcpy(node->links+l,&upLink,sizeof(structncclTopoLink));}// Recursively sort the PCI tree
for(intl=0;l<node->nlinks;l++){structncclTopoLink*link=node->links+l;if(link->type==LINK_PCI&&link->remNode!=upNode)NCCLCHECK(ncclTopoSort(link->remNode,node));}returnncclSuccess;}
This completes the entire graph construction process. To summarize, since the topology analysis output in XML format is not convenient for subsequent path searching, this section performed graph construction of the PCI system based on the XML.
NCCL Source Code Study - This article is part of a series.