Efficient Cross-GPU Communication for Disaggregated LLM Serving

Jun 12, 2026·
Dezhi Yu
Dezhi Yu
· 0 min read
Abstract
Large Language Model (LLM) serving increasingly relies on disaggregated architectures to improve resource utilization and scalability. However, existing LLM systems often depend on hardware-specific communication libraries tightly coupled to particular RDMA transports, resulting in fragmented implementations, limited portability, and significant engineering overhead when deploying across heterogeneous cloud environments. In this paper, we present CommBridge, a portable communication runtime for distributed LLM serving. CommBridge introduces a unified abstraction layer that decouples LLM communication primitives from underlying RDMA implementations, enabling seamless deployment across diverse networking backends including InfiniBand, RoCE, and AWS Elastic Fabric Adapter (EFA). The system exposes a minimal set of communication primitives optimized for key LLM workloads such as KV-cache migration, Mixture-of-Experts (MoE) dispatch and aggregation, model weight synchronization, and distributed inference scheduling. We implement CommBridge in production-scale LLM serving environments and evaluate it across clusters ranging from 64 to 2,048 GPUs. Experimental results demonstrate that CommBridge achieves performance comparable to highly optimized vendor-specific implementations while significantly reducing system complexity and improving deployment portability. Across representative LLM inference and training workloads, CommBridge improves end-to-end throughput by up to 2.3x and reduces communication latency by up to 47% compared with existing framework-integrated approaches. Our results suggest that communication portability can be achieved without sacrificing performance, providing a practical foundation for next-generation cloud-native LLM infrastructure.
Type
Publication
Manuscript in preparation
publication
Dezhi Yu
Authors
Senior ML Engineer

I am a research-oriented machine learning systems engineer working on foundation model infrastructure, alignment, and evaluation. My work focuses on building efficient, reliable systems for large language models while studying the algorithms and data choices that make these models more useful, controllable, and cost-effective in real applications.

At TikTok, my recent work centers on Model-as-a-Service platforms and high-performance LLM inference. I develop serving infrastructure with vLLM and SGLang across model runtime integration, scheduling and continuous batching, KV-cache and memory management, distributed execution, observability, and reliability. This systems work is closely connected to my research on distributed disaggregated inference, preference optimization, instruction-tuning data selection, multimodal evaluation, and retrieval-augmented biomedical summarization.

My broader research spans reinforcement learning for robotics, healthcare sequence modeling, privacy-preserving machine learning, and motion planning. I am especially interested in model-system co-design: how model architecture, inference algorithms, data curation, hardware utilization, scheduling, and distributed runtimes interact. My goal is to advance frontier AI systems that are faster to experiment with, more rigorous to evaluate, and dependable enough to serve at scale.