Efficient Cross-GPU Communication for Disaggregated LLM Serving

Jun 12, 2026·

Dezhi Yu

· 0 min read

Abstract

Large Language Model (LLM) serving increasingly relies on disaggregated architectures to improve resource utilization and scalability. However, existing LLM systems often depend on hardware-specific communication libraries tightly coupled to particular RDMA transports, resulting in fragmented implementations, limited portability, and significant engineering overhead when deploying across heterogeneous cloud environments. In this paper, we present CommBridge, a portable communication runtime for distributed LLM serving. CommBridge introduces a unified abstraction layer that decouples LLM communication primitives from underlying RDMA implementations, enabling seamless deployment across diverse networking backends including InfiniBand, RoCE, and AWS Elastic Fabric Adapter (EFA). The system exposes a minimal set of communication primitives optimized for key LLM workloads such as KV-cache migration, Mixture-of-Experts (MoE) dispatch and aggregation, model weight synchronization, and distributed inference scheduling. We implement CommBridge in production-scale LLM serving environments and evaluate it across clusters ranging from 64 to 2,048 GPUs. Experimental results demonstrate that CommBridge achieves performance comparable to highly optimized vendor-specific implementations while significantly reducing system complexity and improving deployment portability. Across representative LLM inference and training workloads, CommBridge improves end-to-end throughput by up to 2.3x and reduces communication latency by up to 47% compared with existing framework-integrated approaches. Our results suggest that communication portability can be achieved without sacrificing performance, providing a practical foundation for next-generation cloud-native LLM infrastructure.

Type

Manuscript

Publication

Manuscript in preparation

Last updated on Jun 12, 2026

LLM Serving Distributed Systems GPU Communication RDMA Machine Learning Systems Cloud Infrastructure

Authors

Dezhi Yu

Senior ML Engineer

I am a research-oriented machine learning systems engineer working on foundation model infrastructure, closed-loop evaluation and optimization systems, and scalable AI platforms. My work focuses on building reliable Model-as-a-Service and Harness-as-a-Service platforms that connect data, training, inference, evaluation, and feedback loops into measurable, continuously improving AI products.

My recent work centers on Model-as-a-Service platforms and high-performance LLM inference. I develop serving infrastructure with vLLM and SGLang across model runtime integration, scheduling and continuous batching, KV-cache and memory management, distributed execution, observability, and reliability. This systems work is closely connected to my research on distributed disaggregated inference, preference optimization, instruction-tuning data selection, multimodal evaluation.

My broader research centers on reinforcement learning infrastructure and reinforcement learning optimization algorithms for scalable AI systems. I am interested in how policy optimization, reward modeling, preference learning, offline RL, simulation environments, distributed rollout systems, and automated evaluation harnesses can be engineered together to improve model behavior. My goal is to build frontier AI systems that learn from feedback efficiently, evaluate progress rigorously, and remain dependable when deployed at scale.

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study Jun 1, 2026 →