This is a Sonnet 3.6 translation of a Chinese article. Please be mindful of potential translation errors.
Let’s continue from the last section where all nodes established bootstrap network connections, and now we’ll discuss topology analysis.
Since GPU machine architectures vary greatly, with a single machine potentially having multiple NICs and GPU cards with different interconnections, it’s necessary to analyze the device connection topology within machines to achieve optimal performance across various topological structures.
Continuing from before, let’s examine initTransportsRank.
staticncclResult_tinitTransportsRank(structncclComm*comm,ncclUniqueId*commId){// We use 3 AllGathers
// 1. { peerInfo, comm }
// 2. ConnectTransport[nranks], ConnectValue[nranks]
// 3. { nThreads, nrings, compCap, prev[MAXCHANNELS], next[MAXCHANNELS] }
intrank=comm->rank;intnranks=comm->nRanks;uint64_tcommHash=getHash(commId->internal,NCCL_UNIQUE_ID_BYTES);TRACE(NCCL_INIT,"comm %p, commHash %lx, rank %d nranks %d - BEGIN",comm,commHash,rank,nranks);NCCLCHECK(bootstrapInit(commId,rank,nranks,&comm->bootstrap));// AllGather1 - begin
struct{structncclPeerInfopeerInfo;structncclComm*comm;}*allGather1Data;NCCLCHECK(ncclCalloc(&allGather1Data,nranks));allGather1Data[rank].comm=comm;structncclPeerInfo*myInfo=&allGather1Data[rank].peerInfo;NCCLCHECK(fillInfo(comm,myInfo,commHash));...}
Creates nrank allGather1Data objects, then fills the current rank’s peerInfo through fillInfo. ncclPeerInfo contains basic rank information, such as rank number, which machine and process it belongs to, etc.
structncclPeerInfo{intrank;intcudaDev;intgdrSupport;uint64_thostHash;uint64_tpidHash;dev_tshmDev;int64_tbusId;};staticncclResult_tfillInfo(structncclComm*comm,structncclPeerInfo*info,uint64_tcommHash){info->rank=comm->rank;CUDACHECK(cudaGetDevice(&info->cudaDev));info->hostHash=getHostHash()+commHash;info->pidHash=getPidHash()+commHash;// Get the device MAJOR:MINOR of /dev/shm so we can use that
// information to decide whether we can use SHM for inter-process
// communication in a container environment
structstatstatbuf;SYSCHECK(stat("/dev/shm",&statbuf),"stat");info->shmDev=statbuf.st_dev;info->busId=comm->busId;NCCLCHECK(ncclGpuGdrSupport(&info->gdrSupport));returnncclSuccess;}
Gets the current card’s rank, PCIe busId, device number in /dev/shm, fills this into ncclPeerInfo, then checks GDR support through ncclGpuGdrSupport. RDMA needs to register memory before communication so the NIC knows virtual to physical address mapping. If data needs to be copied from GPU memory to system memory for each communication, efficiency would be low. IB provides peer memory interface allowing IB NICs to access other PCIe spaces. NVIDIA implemented its own driver based on peer memory, enabling RDMA to directly register GPU memory, avoiding host-device memory copies as IB can directly DMA GPU memory - this is GDR.
staticncclResult_tncclGpuGdrSupport(int*gdrSupport){intnetDevs;NCCLCHECK(ncclNetDevices(&netDevs));*gdrSupport=0;for(intdev=0;dev<netDevs;dev++){// Find a net device which is GDR-capable
ncclNetProperties_tprops;NCCLCHECK(ncclNet->getProperties(dev,&props));if((props.ptrSupport&NCCL_PTR_CUDA)==0)continue;// Allocate memory on the GPU and try to register it on the NIC.
void*lComm=NULL,*sComm=NULL,*rComm=NULL;ncclNetHandle_thandle;void*gpuPtr=NULL;void*mHandle=NULL;NCCLCHECK(ncclNetListen(dev,&handle,&lComm));NCCLCHECK(ncclNetConnect(dev,&handle,&sComm));NCCLCHECK(ncclNetAccept(lComm,&rComm));CUDACHECK(cudaMalloc(&gpuPtr,GPU_BUF_SIZE));ncclDebugNoWarn=NCCL_NET;if(ncclNetRegMr(sComm,gpuPtr,GPU_BUF_SIZE,NCCL_PTR_CUDA,&mHandle)==ncclSuccess){NCCLCHECK(ncclNetDeregMr(sComm,mHandle));NCCLCHECK(ncclNetRegMr(rComm,gpuPtr,GPU_BUF_SIZE,NCCL_PTR_CUDA,&mHandle));NCCLCHECK(ncclNetDeregMr(rComm,mHandle));*gdrSupport=1;}ncclDebugNoWarn=0;CUDACHECK(cudaFree(gpuPtr));NCCLCHECK(ncclNetCloseRecv(rComm));NCCLCHECK(ncclNetCloseSend(sComm));NCCLCHECK(ncclNetCloseListen(lComm));break;}returnncclSuccess;}
This iterates through each NIC to get its information. From the first section we know ncclNet here refers to ncclNetIb.
ncclResult_tncclIbGdrSupport(intibDev){staticintmoduleLoaded=-1;if(moduleLoaded==-1){moduleLoaded=(access("/sys/kernel/mm/memory_peers/nv_mem/version",F_OK)==-1)?0:1;}if(moduleLoaded==0)returnncclSystemError;returnncclSuccess;}ncclResult_tncclIbGetProperties(intdev,ncclNetProperties_t*props){props->name=ncclIbDevs[dev].devName;props->pciPath=ncclIbDevs[dev].pciPath;props->guid=ncclIbDevs[dev].guid;props->ptrSupport=NCCL_PTR_HOST;if(ncclIbGdrSupport(dev)!=ncclSuccess){INFO(NCCL_NET,"NET/IB : GPU Direct RDMA Disabled for HCA %d '%s' (no module)",dev,ncclIbDevs[dev].devName);}else{props->ptrSupport|=NCCL_PTR_CUDA;}props->speed=ncclIbDevs[dev].speed;props->port=ncclIbDevs[dev].port+ncclIbDevs[dev].realPort;props->maxComms=ncclIbDevs[dev].maxQp;returnncclSuccess;}
This primarily gets the NIC name, PCIe path, GUID and other information, then checks for /sys/kernel/mm/memory_peers/nv_mem/version to determine if nv_peermem (NVIDIA’s driver) is installed. If installed, sets props->ptrSupport |= NCCL_PTR_CUDA indicating GPU memory can be registered.
Then attempts to register GPU memory. If registration succeeds, sets gdrSupport to 1. This actually creates RDMA connections, which we’ll cover separately later.
cpp16 lines hidden
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
staticncclResult_tinitTransportsRank(structncclComm*comm,ncclUniqueId*commId){...NCCLCHECK(bootstrapAllGather(comm->bootstrap,allGather1Data,sizeof(*allGather1Data)));NCCLCHECK(ncclCalloc(&comm->peerInfo,nranks+1));// Extra rank to represent CollNet root
for(inti=0;i<nranks;i++){memcpy(comm->peerInfo+i,&allGather1Data[i].peerInfo,sizeof(structncclPeerInfo));if((i!=rank)&&(comm->peerInfo[i].hostHash==myInfo->hostHash)&&(comm->peerInfo[i].busId==myInfo->busId)){WARN("Duplicate GPU detected : rank %d and rank %d both on CUDA device %x",rank,i,myInfo->busId);returnncclInvalidUsage;}}// AllGather1 data is used again below
// AllGather1 - end
// Topo detection / System graph creation
NCCLCHECK(ncclTopoGetSystem(comm,&comm->topo));...}
Then bootstrapAllGather broadcasts allGather1Data, copying obtained peer info from other nodes into comm.
Before examining the specific topology analysis process, let’s briefly understand some PCIe concepts. Here’s a simple PCIe system example:
Each CPU has its own root complex (RC). RC helps CPU communicate with other parts like memory and PCIe system. When CPU sends a physical address, if it’s in PCIe space, RC converts it to PCIe requests for communication.
Switches expand PCIe ports and can connect to devices or other switches. Upstream requests are forwarded by them. PCIe devices can connect to RC or switches. Here’s a switch’s internal structure:
It has an internal PCIe bus with multiple bridges extending to multiple ports. The top one is called upstream port, others are downstream ports.
The previously mentioned busId in NCCL (for GPU and IB NICs) isn’t actually the bus number - it’s the BDF (bus + device + function) ID used to locate PCIe devices. A bus can have multiple devices, each with multiple functions, so BDF can locate specific devices. After machine completes PCIe configuration, related information is provided to users through sysfs, which NCCL uses for topology detection.
Let’s examine ncclTopoGetSystem, this section’s focus. It builds the current rank’s PCI tree in two steps: first representing the entire PCI tree structure in XML, then converting to ncclTopoNode. The XML is defined as follows, with ncclXmlNode representing a PCI tree node:
cpp14 lines hidden
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
structncclXmlNode{charname[MAX_STR_LEN];struct{charkey[MAX_STR_LEN];charvalue[MAX_STR_LEN];}attrs[MAX_ATTR_COUNT+1];// Need an extra one to consume extra params
intnAttrs;inttype;structncclXmlNode*parent;structncclXmlNode*subs[MAX_SUBS];intnSubs;};structncclXml{structncclXmlNodenodes[MAX_NODES];intmaxIndex;};
ncclXmlNode represents a node, recording parent and all child nodes, with name and attributes set through xmlSetAttr.
ncclXml pre-allocates all nodes, with maxIndex indicating allocation progress. Here are some basic XML APIs:
ncclResult_tncclTopoGetSystem(structncclComm*comm,structncclTopoSystem**system){structncclXml*xml;NCCLCHECK(ncclCalloc(&xml,1));char*xmlTopoFile=getenv("NCCL_TOPO_FILE");if(xmlTopoFile){INFO(NCCL_ENV,"NCCL_TOPO_FILE set by environment to %s",xmlTopoFile);NCCLCHECK(ncclTopoGetXmlFromFile(xmlTopoFile,xml));}if(xml->maxIndex==0){// Create top tag
structncclXmlNode*top;NCCLCHECK(xmlAddNode(xml,NULL,"system",&top));NCCLCHECK(xmlSetAttrInt(top,"version",NCCL_TOPO_XML_VERSION));}// Auto-detect GPUs if needed
for(intr=0;r<comm->nRanks;r++){if(comm->peerInfo[r].hostHash==comm->peerInfo[comm->rank].hostHash){charbusId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];NCCLCHECK(int64ToBusId(comm->peerInfo[r].busId,busId));structncclXmlNode*node;NCCLCHECK(ncclTopoFillGpu(xml,busId,&node));if(node==NULL)continue;NCCLCHECK(xmlSetAttrInt(node,"rank",r));NCCLCHECK(xmlInitAttrInt(node,"gdr",comm->peerInfo[r].gdrSupport));}}...}
First creates root node “system” via xmlAddNode, sets root node attribute “system”[“version”] = NCCL_TOPO_XML_VERSION, then iterates through each rank’s hosthash. If equal, indicates same machine, then executes ncclTopoFillGpu to add GPU to XML tree.
Uses ncclTopoGetPciNode to check if current card’s XML node exists in XML. If not, creates new XML node “pci” representing current GPU card, sets “pci”[“busid”]=busId.
Then executes ncclTopoGetXmlFromSys, which mainly gets GPU node to CPU path in sysfs, converts this path to XML tree, and sets related attributes from this path to XML.
ncclResult_tncclTopoGetXmlFromSys(structncclXmlNode*pciNode,structncclXml*xml){// Fill info, then parent
constchar*busId;NCCLCHECK(xmlGetAttr(pciNode,"busid",&busId));char*path=NULL;intindex;NCCLCHECK(xmlGetAttrIndex(pciNode,"class",&index));if(index==-1){if(path==NULL)NCCLCHECK(getPciPath(busId,&path));NCCLCHECK(ncclTopoSetAttrFromSys(pciNode,path,"class","class"));}NCCLCHECK(xmlGetAttrIndex(pciNode,"link_speed",&index));if(index==-1){if(path==NULL)NCCLCHECK(getPciPath(busId,&path));chardeviceSpeedStr[MAX_STR_LEN];floatdeviceSpeed;NCCLCHECK(ncclTopoGetStrFromSys(path,"max_link_speed",deviceSpeedStr));sscanf(deviceSpeedStr,"%f GT/s",&deviceSpeed);charportSpeedStr[MAX_STR_LEN];floatportSpeed;NCCLCHECK(ncclTopoGetStrFromSys(path,"../max_link_speed",portSpeedStr));sscanf(portSpeedStr,"%f GT/s",&portSpeed);NCCLCHECK(xmlSetAttr(pciNode,"link_speed",portSpeed<deviceSpeed?portSpeedStr:deviceSpeedStr));}NCCLCHECK(xmlGetAttrIndex(pciNode,"link_width",&index));if(index==-1){if(path==NULL)NCCLCHECK(getPciPath(busId,&path));charstrValue[MAX_STR_LEN];NCCLCHECK(ncclTopoGetStrFromSys(path,"max_link_width",strValue));intdeviceWidth=strtol(strValue,NULL,0);NCCLCHECK(ncclTopoGetStrFromSys(path,"../max_link_width",strValue));intportWidth=strtol(strValue,NULL,0);NCCLCHECK(xmlSetAttrInt(pciNode,"link_width",std::min(deviceWidth,portWidth)));}...}
First sets various pciNode attributes, gets sysfs path corresponding to busId through getPciPath - this path represents root to leaf node path in PCI tree.
cpp8 lines hidden
1
2
3
4
5
6
7
8
9
10
11
staticncclResult_tgetPciPath(constchar*busId,char**path){charbusPath[]="/sys/class/pci_bus/0000:00/../../0000:00:00.0";memcpylower(busPath+sizeof("/sys/class/pci_bus/")-1,busId,BUSID_REDUCED_SIZE-1);memcpylower(busPath+sizeof("/sys/class/pci_bus/0000:00/../../")-1,busId,BUSID_SIZE-1);*path=realpath(busPath,NULL);if(*path==NULL){WARN("Could not find real path of %s",busPath);returnncclSystemError;}returnncclSuccess;}
For example, if path is /sys/devices/pci0000:10/0000:10:00.0/0000:11:00.0/0000:12:00.0/0000:13:00.0/0000:14:00.0/0000:15:00.0/0000:16:00.0/0000:17:00.0, where GPU’s busId is 0000:17:00.0, it corresponds to this diagram (note: switch corresponding to 15:00.0 is omitted):
Then reads path properties, gets class (PCI device type), link_speed, link_width etc. and sets them in XML pciNode. ncclTopoGetStrFromSys simply reads kernel files under path into strValue.
ncclResult_tncclTopoGetStrFromSys(constchar*path,constchar*fileName,char*strValue){charfilePath[PATH_MAX];sprintf(filePath,"%s/%s",path,fileName);intoffset=0;FILE*file;if((file=fopen(filePath,"r"))!=NULL){while(feof(file)==0&&ferror(file)==0&&offset<MAX_STR_LEN){intlen=fread(strValue+offset,1,MAX_STR_LEN-offset,file);offset+=len;}fclose(file);}if(offset==0){strValue[0]='\0';INFO(NCCL_GRAPH,"Topology detection : could not read %s, ignoring",filePath);}else{strValue[offset-1]='\0';}returnncclSuccess;}
ncclResult_tncclTopoGetXmlFromSys(structncclXmlNode*pciNode,structncclXml*xml){// Fill info, then parent
...structncclXmlNode*parent=pciNode->parent;if(parent==NULL){if(path==NULL)NCCLCHECK(getPciPath(busId,&path));// Save that for later in case next step is a CPU
charnumaIdStr[MAX_STR_LEN];NCCLCHECK(ncclTopoGetStrFromSys(path,"numa_node",numaIdStr));// Go up one level in the PCI tree. Rewind two "/" and follow the upper PCI
// switch, or stop if we reach a CPU root complex.
intslashCount=0;intparentOffset;for(parentOffset=strlen(path)-1;parentOffset>0;parentOffset--){if(path[parentOffset]=='/'){slashCount++;path[parentOffset]='\0';intstart=parentOffset-1;while(start>0&&path[start]!='/')start--;// Check whether the parent path looks like "BBBB:BB:DD.F" or not.
if(checkBDFFormat(path+start+1)==0){// This a CPU root complex. Create a CPU tag and stop there.
structncclXmlNode*topNode;NCCLCHECK(xmlFindTag(xml,"system",&topNode));NCCLCHECK(xmlGetSubKv(topNode,"cpu",&parent,"numaid",numaIdStr));if(parent==NULL){NCCLCHECK(xmlAddNode(xml,topNode,"cpu",&parent));NCCLCHECK(xmlSetAttr(parent,"numaid",numaIdStr));}}elseif(slashCount==2){// Continue on the upper PCI switch
for(inti=strlen(path)-1;i>0;i--){if(path[i]=='/'){NCCLCHECK(xmlFindTagKv(xml,"pci",&parent,"busid",path+i+1));if(parent==NULL){NCCLCHECK(xmlAddNode(xml,NULL,"pci",&parent));NCCLCHECK(xmlSetAttr(parent,"busid",path+i+1));}break;}}}}if(parent)break;}pciNode->parent=parent;parent->subs[parent->nSubs++]=pciNode;}if(strcmp(parent->name,"pci")==0){NCCLCHECK(ncclTopoGetXmlFromSys(parent,xml));}elseif(strcmp(parent->name,"cpu")==0){NCCLCHECK(ncclTopoGetXmlFromCpu(parent,xml));}free(path);returnncclSuccess;}
Then moves upward from pciNode. Since a switch’s upstream and downstream ports each correspond to a bridge, NCCL uses upstream port bridge’s busId to represent this switch. Therefore, it jumps up twice before creating new XML node for this switch. Increments slashCount for each PCI device found upward. When slashCount==2, finds switch upstream port, creates new XML pci node parent representing current switch, links current node pciNode to parent. Since parent is still XML pci node, recursively executes ncclTopoGetXmlFromSys until reaching RC. Then creates child node “cpu” under “system”, stops recursion, executes ncclTopoGetXmlFromCpu to set various “cpu” attributes like arch (e.g., x86 or arm), affinity (which CPU cores belong to this CPU’s NUMA), numaid etc.
This completes ncclTopoGetXmlFromSys. Back to ncclTopoFillGpu:
ncclResult_tncclTopoGetXmlFromGpu(structncclXmlNode*pciNode,nvmlDevice_tnvmlDev,structncclXml*xml,structncclXmlNode**gpuNodeRet){structncclXmlNode*gpuNode=NULL;NCCLCHECK(xmlGetSub(pciNode,"gpu",&gpuNode));if(gpuNode==NULL)NCCLCHECK(xmlAddNode(xml,pciNode,"gpu",&gpuNode));intindex=-1;intdev=-1;NCCLCHECK(xmlGetAttrIndex(gpuNode,"dev",&index));if(index==-1){if(nvmlDev==NULL){WARN("No NVML, trying to use CUDA instead");constchar*busId;NCCLCHECK(xmlGetAttr(pciNode,"busid",&busId));if(busId==NULL||cudaDeviceGetByPCIBusId(&dev,busId)!=cudaSuccess)dev=-1;}else{NCCLCHECK(wrapNvmlDeviceGetIndex(nvmlDev,(unsignedint*)&dev));}NCCLCHECK(xmlSetAttrInt(gpuNode,"dev",dev));}NCCLCHECK(xmlGetAttrInt(gpuNode,"dev",&dev));if(dev==-1){*gpuNodeRet=NULL;returnncclSuccess;}NCCLCHECK(xmlGetAttrIndex(gpuNode,"sm",&index));if(index==-1){intcudaMajor,cudaMinor;if(nvmlDev==NULL){cudaDevicePropdevProp;CUDACHECK(cudaGetDeviceProperties(&devProp,dev));cudaMajor=devProp.major;cudaMinor=devProp.minor;}else{NCCLCHECK(wrapNvmlDeviceGetCudaComputeCapability(nvmlDev,&cudaMajor,&cudaMinor));}NCCLCHECK(xmlSetAttrInt(gpuNode,"sm",cudaMajor*10+cudaMinor));}intsm;NCCLCHECK(xmlGetAttrInt(gpuNode,"sm",&sm));structncclXmlNode*nvlNode=NULL;NCCLCHECK(xmlGetSub(pciNode,"nvlink",&nvlNode));if(nvlNode==NULL){// NVML NVLink detection
intmaxNvLinks=(sm<60)?0:(sm<70)?4:(sm<80)?6:12;if(maxNvLinks>0&&nvmlDev==NULL){WARN("No NVML device handle. Skipping nvlink detection.\n");maxNvLinks=0;}for(intl=0;l<maxNvLinks;++l){// Check whether we can use this NVLink for P2P
unsignedcanP2P;if((wrapNvmlDeviceGetNvLinkCapability(nvmlDev,l,NVML_NVLINK_CAP_P2P_SUPPORTED,&canP2P)!=ncclSuccess)||!canP2P)continue;// Make sure the Nvlink is up. The previous call should have trained the link.
nvmlEnableState_tisActive;if((wrapNvmlDeviceGetNvLinkState(nvmlDev,l,&isActive)!=ncclSuccess)||(isActive!=NVML_FEATURE_ENABLED))continue;// Try to figure out what's on the other side of the NVLink
nvmlPciInfo_tremoteProc;if(wrapNvmlDeviceGetNvLinkRemotePciInfo(nvmlDev,l,&remoteProc)!=ncclSuccess)continue;// Make a lower case copy of the bus ID for calling ncclDeviceType
// PCI system path is in lower case
char*p=remoteProc.busId;charlowerId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];for(intc=0;c<NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE;c++){lowerId[c]=tolower(p[c]);if(p[c]==0)break;}NCCLCHECK(xmlGetSubKv(gpuNode,"nvlink",&nvlNode,"target",lowerId));if(nvlNode==NULL){NCCLCHECK(xmlAddNode(xml,gpuNode,"nvlink",&nvlNode));NCCLCHECK(xmlSetAttr(nvlNode,"target",lowerId));NCCLCHECK(xmlSetAttrInt(nvlNode,"count",1));}else{intcount;NCCLCHECK(xmlGetAttrInt(nvlNode,"count",&count));NCCLCHECK(xmlSetAttrInt(nvlNode,"count",count+1));}}}// Fill target classes
for(ints=0;s<gpuNode->nSubs;s++){structncclXmlNode*sub=gpuNode->subs[s];if(strcmp(sub->name,"nvlink")!=0)continue;intindex;NCCLCHECK(xmlGetAttrIndex(sub,"tclass",&index));if(index==-1){constchar*busId;NCCLCHECK(xmlGetAttr(sub,"target",&busId));if(strcmp(busId,"fffffff:ffff:ff")==0){// Remote NVLink device is not visible inside this VM. Assume NVSwitch.
NCCLCHECK(xmlSetAttr(sub,"tclass","0x068000"));}else{char*path;NCCLCHECK(getPciPath(busId,&path));NCCLCHECK(ncclTopoSetAttrFromSys(sub,path,"class","tclass"));}}}*gpuNodeRet=gpuNode;returnncclSuccess;}
First creates node “gpu” under XML GPU node “pci”, sets “gpu” node attributes like dev, compute capability sm, then queries NVLink information. Iterates through all possible NVLinks, queries NVLink information through nvmlDeviceGetNvLinkCapability. If NVLink is enabled, creates new “nvlink” node under “gpu” node, sets “target” attribute indicating NVLink peer’s PCIe busId. Represents “nvlink” nodes with same “target” as one, uses “count” to indicate number of NVLinks between endpoints, sets “tclass” attribute indicating “target”’s PCI device type.
This completes ncclTopoFillGpu. XML now looks like this (showing single NIC case, where “gpu” and its parent node refer to same GPU):
Back to ncclTopoGetSystem, sets “gpu”’s rank and gdr attributes.
Then for all NICs, similar to GPU process, builds XML tree through ncclTopoGetXmlFromSys as shown (single NIC case, where “net”, “nic” and “nic”’s parent node represent same NIC):
Finally, here’s what the corresponding XML looks like:
In summary, this section covered NCCL’s topology analysis process, building XML tree representation of PCI tree structure for GPUs and NICs using sysfs.
NCCL Source Code Study - This article is part of a series.