College of Science & Engineering
Twin Cities
This project focuses on optimizing the application of Multi-GPU Tensor, and Pipeline Parallelism for LLM and VLM serving by implementing a system that extends existing single-GPU technologies - KV Caching, prefix sharing, and radix tree structures - to multiple GPUs. The innovation of the proposal lies in adapting these established methods to enhance multi-GPU efficiency and performance, with specific emphasis on advanced load balancing and scheduling strategies. The approach includes developing a multi-GPU radix tree to manage user requests across GPUs effectively and refining the cache-aware scheduling process to efficiently handle the complexity of multiple GPUs. This initiative aims to improve throughput and reduce both average and tail latencies, significantly boosting the real-time processing capabilities of large-scale models.