Professor David Du

CSENG Computer Science & Eng
College of Science & Engineering
Twin Cities

Project Title:

Optimizing Multi-GPU Data, Tensor, and Pipeline Parallelism LLM Serving With Prefix Caching

This project focuses on optimizing the application of Multi-GPU Tensor, and Pipeline Parallelism for LLM and VLM serving by implementing a system that extends existing single-GPU technologies - KV Caching, prefix sharing, and radix tree structures - to multiple GPUs. The innovation of the proposal lies in adapting these established methods to enhance multi-GPU efficiency and performance, with specific emphasis on advanced load balancing and scheduling strategies. The approach includes developing a multi-GPU radix tree to manage user requests across GPUs effectively and refining the cache-aware scheduling process to efficiently handle the complexity of multiple GPUs. This initiative aims to improve throughput and reduce both average and tail latencies, significantly boosting the real-time processing capabilities of large-scale models.

Project Investigators

Huibing Dong

Professor David Du

Haoyu Gong

Wenlong Wang

Are you a member of this group? Log in to see more information.