Researchers solve silent packet drop problem in multi-node chip interconnects
A group of researchers from South Korea has proposed a method to more reliably scale chip-to-chip interconnects. The ever-increasing scale of AI models, such as Meta’s mammoth Llama 3.1 405B and Anthropic’s recently released Claude 4 family of models, requires massive multi-node computing clusters where reliable chip interconnects are crucial.
Several methods exist to support chip-to-chip interconnects, but researchers from Sungkyunkwan University and the Korea Advanced Institute of Science and Technology (KAIST) took issue with existing protocols, such as Compute Express Link (CXL), Nvidia’s NVLink, and the recently launched UALink.
They argue that while such protocols might work well for point-to-point links, they become unreliable in multi-node, switched environments (e.g., AI training loads distributed across multiple interconnected systems via a high-speed network switch) due to silent packet drops – where data packets are randomly lost within a network without generating any error messages.
To address this, they introduce Implicit Sequence Number (ISN), a technique that embeds sequence tracking into the Cyclic Redundancy Check (CRC) mechanism to detect when silent data drops occur, without hardware add-ons further complicating matters.
The concept is supported by Reliability Extended Link (RXL), a novel backward-compatible extension of the existing CXL protocols, which they contend can better support more scalable and robust multi-node communication. Effectively, the researchers have found a way to track data packets and detect when drops occur without adding extra bits by simply folding sequence information into existing mechanisms like the CRC.
“By embedding sequence tracking into CRC, ISN eliminates the need for explicit sequence numbers in flit headers, reducing header overhead without compromising sequence integrity,” the paper reads. “The proposed RXL enhances the robustness of the chip interconnect by ensuring end-to-end data and sequence validation while maintaining compatibility with existing flit structures.”
The authors argue that existing protocols like CXL use limited header fields and are forced to juggle both sequencing and acknowledgment data, which ultimately obscures a packet's original order and makes it improbable to detect when something goes missing midstream.
Their proposed ISN concept avoids this ambiguity. Because the CRC is calculated using the expected sequence number, any mismatch, even due to a missing data packet, is instead flagged as a CRC error, triggering a retry.
RXL then builds on this by shifting CRC checks from the link layer to the transport layer, enabling end-to-end sequence validation. At the same time, it leaves link layers like FEC (Forward Error Correction) untouched, preserving error correction on each path.
“While FEC is primarily designed for error correction, it can also detect most uncorrectable errors,” the researchers wrote. “Although FEC’s error detection capability is not exhaustive, it provides early identification of uncorrectable errors, reducing unnecessary forwarding of erroneous flits.”
The result, the authors claim, is a dramatically more reliable interconnect. Their paper suggests that RXL dramatically reduced undetected ordering failures by more than 10¹⁸x in simulated switched environments when compared to standard CXL – a result achieved without any measurable increase in performance overhead or hardware complexity.
“Our evaluation demonstrates that RXL achieves the same level of performance as existing method[s] while addressing the critical reliability vulnerability of ordering failures,” the paper reads. “Additionally, RXL delivers strong end-to-end protection against data corruption, ensuring even errors internal to switches are effectively detected and mitigated.”
The paper follows recent research from Chinese scientists into a new method for sharing GPUs more efficiently across jobs in campus data centers. Dubbed gPooling, the pooling solution was able to improve utilization by up to 2x on a cluster powered by Nvidia A100s.
node architecture, graph node, node degree, network topology, node classification, centrality measures, node attributes, node embedding, node connectivity, hierarchical nodes, distributed nodes, sensor node, edge node, node clustering, node visualization, active node, passive node, node failure, node traversal, dynamic node
#NodeArchitecture, #GraphTheory, #NetworkScience, #NodeClassification, #Centrality, #NodeEmbedding, #IoTNode, #EdgeComputing, #GraphAnalytics, #DynamicNodes, #NodeTraversal, #NodeDegree, #GraphNetworks, #DataNode, #TopologyMapping, #NodeConnectivity, #NodeClustering, #DistributedSystems, #NetworkTopology, #SensorNetwork
Comments
Post a Comment