Mark Harris

Written by Mark Harris

Published on July 30, 2025

The relentless pursuit of more intelligent AI models, particularly large language models (LLMs) and deep learning algorithms, has driven an unprecedented demand for computational power. At the heart of this revolution are Graphics Processing Units (GPUs), which, with their parallel processing capabilities, are perfectly suited for the intensive, iterative calculations that define AI training. However, unleashing the full potential of thousands of interconnected GPUs in a data center environment isn’t simply a matter of plugging them in. It requires a sophisticated networking infrastructure capable of handling massive data flows with minimal latency and, critically, zero packet loss. This is where Data Center Bridging (DCB), coupled with advanced flow control mechanisms like Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN), becomes absolutely essential.

Network Congestion in Large-Scale GPU Clusters, An Overview

Imagine an AI training job spread across hundreds or even thousands of GPUs. These GPUs are constantly exchanging colossal amounts of data—parameters, gradients, activation values—often in highly synchronized bursts. This “all-to-one” or “many-to-one” communication pattern, known as incast, can quickly overwhelm traditional Ethernet networks, commonly referred to as throughput collapse. Without proper mechanisms, network buffers overflow, leading to packet drops. In the context of AI, packet loss is not just an inconvenience; it can severely degrade training efficiency, increase training times, and even lead to model convergence issues. Retransmissions due to packet loss introduce significant latency, effectively nullifying the high-speed processing capabilities of the GPUs.

Network Connectivity, Congestion and GPU Utilization

So the success of the network becomes the key to high utilization of these capital-intensive GPUs, for both training and Inference:

  • LLM Inference (Model Parallelism and Batching): For very large LLMs that require model parallelism (where different layers or parts of the model reside on different GPUs, potentially on different servers), each inference request involves sequential data transfers between GPUs as the prompt flows through the model’s layers. If the network path between these GPUs suffers from even brief congestion, the entire inference pipeline stalls. Similarly, when using batching to maximize GPU utilization, a small amount of packet loss or delay due to congestion for just one part of a batch can delay the completion of the entire batch, causing a ripple effect on subsequent batches. This directly translates to a higher inference latency for end-users and significantly lower inference throughput (queries per second) for the overall system. DCB’s lossless capabilities ensure these critical inter-GPU transfers are never interrupted, maintaining a smooth, high-throughput inference pipeline and maximizing the ROI on your GPU investment.
  • LLM Fine-Tuning (Distributed Training): In a distributed fine-tuning job across hundreds of GPUs, the process involves frequent and massive exchanges of gradient updates and model parameters (e.g., All-Reduce operations). If the network experiences congestion, these collective communication operations slow down significantly. GPUs, being parallel processors, become idle, waiting for data from other GPUs to complete the current iteration before they can start the next. A GPU showing 100% compute utilization might be stalling for network I/O, meaning the effective work done is far less, leading to hours or even days of wasted compute time and increased cloud/electricity costs. PFC ensures these critical All-Reduce packets are never dropped, preventing catastrophic slowdowns, while ECN works to proactively manage the flow to minimize these idle waits.

Data Center Bridging: Creating Lossless AI Networks

Data Center Bridging (DCB) is a set of IEEE standards (802.1Qxx) designed to enhance Ethernet for data center environments, specifically to support converged networks where various traffic types (storage, management, and high-performance computing) coexist. Key to DCB’s role in AI is its ability to create a low latency and lossless Ethernet fabric, ensuring that critical AI traffic experiences no packet loss. Two pivotal components of DCB that make this possible are Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN).

Priority-based Flow Control (PFC): Preventing Packet Loss at the Link Level

PFC (IEEE 802.1Qbb) is a link-level flow control mechanism that extends the traditional Ethernet PAUSE frame. Unlike the standard PAUSE frame which halts all traffic on a link, PFC allows for selective pausing of traffic based on its Class of Service (CoS) priority.

Here’s a simplified explanation of how PFC works in a GPU-dense environment:

  • Traffic Classification: AI training traffic, often utilizing Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCEv2), is assigned a specific, high-priority CoS. This ensures it’s treated as critical data by the network.
  • Congestion Detection: When a switch’s egress buffer for a particular CoS queue (e.g., the one dedicated to RoCEv2 traffic) reaches a pre-defined threshold, it signifies impending congestion.
  • PFC Pause Frame: The congested switch sends a PFC pause frame back to the upstream transmitting device (another switch or a GPU’s Network Interface Card – NIC). This pause frame is specific to the congested CoS priority.
  • Selective Halting: Upon receiving the PFC pause frame, the upstream device temporarily stops transmitting traffic only for that specific CoS priority. Other traffic classes on the same link remain unaffected.
  • Buffer Recovery and Resume: As the congested buffer drains and its occupancy falls below a resume threshold, the switch sends a PFC resume frame, signaling the upstream device to restart transmission for that priority.

Benefits of PFC for GPUs:

  • Zero Packet Loss: By pausing traffic before buffers overflow, PFC guarantees lossless delivery for critical AI data, which is paramount for the integrity and efficiency of distributed GPU computations.
  • Isolation of Traffic: It prevents a burst of high-priority AI traffic from impacting other, less time-sensitive traffic types on the same link, maintaining overall network stability.
  • Predictable Performance: By eliminating packet loss, PFC contributes to more predictable and consistent performance for GPU communication, reducing jitter and improving job completion times.

However, PFC can have limitations. If not carefully designed and configured, a “PFC storm” can occur, where pause frames propagate extensively, potentially leading to network-wide slowdowns or deadlocks, especially in multi-hop environments. This is why another technology, ECN is added to complement PFC.

Explicit Congestion Notification (ECN): Proactive Congestion Avoidance

ECN (RFC 3168) is a mechanism that allows network devices to signal incipient congestion to endpoints before packet loss occurs. Instead of dropping packets, ECN-capable devices mark packets in the IP header to indicate congestion. It provides the level of active queue management needed for reliable GPU to GPU traffic.

The ECN traffic management process typically unfolds as follows:

  • ECN-Capable Negotiation: During connection establishment (e.g., TCP handshake), the sender and receiver negotiate their ECN capabilities.
  • Congestion Marking: When a network device’s queue utilization reaches an ECN threshold (a lower threshold than the one that would trigger PFC), the device marks incoming ECN-capable packets as “Congestion Experienced” (CE).
  • Receiver Notification: The marked packet reaches the ECN-capable receiver.
  • Sender Feedback: The receiver then echoes this congestion notification back to the sender (e.g., by setting the ECN-Echo (ECE) bit in the TCP header or sending a Congestion Notification Packet (CNP) in RoCEv2).
  • Rate Reduction: Upon receiving the congestion feedback, the sender proactively reduces its transmission rate, thereby alleviating the congestion before buffer overflows and packet drops.

Synergy of PFC and ECN for Large GPU Deployments:

In large-scale AI clusters consisting of hundreds or thousands of costly GPUs, PFC and ECN work in tandem to provide a robust and efficient lossless network and hence increases the delivered value of the GPUs themselves:

  • ECN as First Line of Defense: ECN acts as a proactive mechanism, providing early warnings of congestion. By allowing senders to reduce their rate preemptively, it minimizes the likelihood of reaching PFC thresholds and avoids the more drastic measure of pausing traffic. This “soft” rate adaptation is crucial for maintaining continuous data flow.
  • PFC as Last Resort: If ECN’s proactive measures aren’t sufficient to prevent congestion, or if sudden, massive bursts of traffic occur, PFC steps in as a reactive, hard-stop mechanism to prevent any packet loss whatsoever for the most critical AI traffic.
  • Optimizing RoCEv2 Performance: RoCEv2, widely adopted for GPU interconnections, relies heavily on these mechanisms. ECN signals trigger congestion control algorithms (like Data Center Quantized Congestion Notification – DCQCN) within the NICs, dynamically adjusting transmit rates. PFC ensures that even under extreme load, no RoCEv2 packets are dropped, preserving the integrity of RDMA operations.
  • Balancing Latency and Throughput: By combining ECN’s proactive rate limiting with PFC’s lossless guarantee, network architects can finely tune the network to balance low latency for delay-sensitive “mice flows” (small, interactive messages) and high throughput for large “elephant flows” (bulk data transfers during training).

Real-World Enterprise Example: Multi-Tenant AI Workloads with Diverse QoS and SLA Requirements

In a large enterprise, an AI cluster is rarely dedicated to a single task. Consider a multi-tenant environment where the same GPU cluster supports several distinct AI workloads:

  • Research & Development (R&D) Teams: Running experimental LLM fine-tuning jobs for new product features. These jobs are often large, long-running, and can tolerate slightly higher initial latency but demand guaranteed bandwidth to complete within a specific time window (e.g., overnight, with an SLA for completion by morning). For them, consistent throughput to avoid training delays is paramount, even if it means momentarily slowing down other lower-priority traffic.
  • The Manufacture Quality Control (QC) department: requires real-time defect detection on assembly lines, demanding extremely low latency (<10ms) and near-zero packet loss (e.g., SLA of 99.999% successful inferences) to prevent production line stoppages.
  • Financial Risk Analysis team: Running batch inference jobs for fraud detection or market prediction. These are critical but less interactive, requiring high throughput for large datasets to be processed within a specific compliance window (e.g., by end of day).

DCB, with PFC and ECN, allows network administrators to classify and prioritize these diverse traffic types. RoCEv2 traffic for latency-sensitive inference queries can be assigned the highest priority CoS with aggressive ECN thresholds to ensure proactive rate reduction, while training traffic gets a high-priority CoS for lossless delivery, and other background traffic uses a standard class. This sophisticated traffic management guarantees that high-priority inference queries are not starved by a massive training job, ensuring that each tenant’s SLA requirements are met on the shared GPU infrastructure.

Conclusion

As AI models continue to grow in complexity and scale, the underlying scale-out networking infrastructure becomes an increasingly critical component of overall system performance. Data Center Bridging, with its cornerstone features of Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN), is not merely an optimization or nice-to-have feature; it is a critically important and fundamental enabler for building efficient, reliable, and scalable AI computing solutions with large investments in GPUs. By selecting network connectivity solutions that can ensure lossless communication, proactively manage congestion, and provide granular control over traffic, scalable solutions based on these DCB technologies will empower Enterprises and their AI strategies to push the boundaries of what’s possible using industry standard technologies, accelerating the pace of innovation in the era of artificial intelligence.

If you have any comments, inquiries or questions concerning our products & services, please complete the following form.

By submitting this form, you agree that we may use the data you provide to contact you with information related to your request/submission and Edgecore's solutions and services.
For more information about how we handle and use your personal information, please see our Privacy Policy.