CV

The cutoff date for this data is Dec 14, 2024.

Basics

Name Zonghang Li
Degree Ph.D.
Email lizhuestc@gmail.com
Wechat lizh_uestc
Homepage https://lizonghang.github.io
Github https://github.com/Lizonghang
Google scholar https://scholar.google.com/citations?hl=en&user=1IA-XokAAAAJ
Summary A young geeker and scholar who loves coding and exploring new technologies to realize fantastic ideas.

Work

  • 2024 - Present

    Abu Dhabi, UAE

  • 2020 - 2023

    Chengdu, CN

    Academic Instructor
    Yingcai Honors College of UESTC
    Guiding undergraduate students in the Yingcai Honors College of UESTC to carry out academic research and publish high-quality academic papers.
    • My student Shenglai Zeng was selected as an outstanding student of UESTC and is pursuing his Ph.D. degree in Michigan State University. Our paper won the 2023 Best Paper Award from IEEE Transactions on Cloud Computing.
  • 2019 - 2020

    Shenzhen, CN

    Invited Technical Instructor
    Peng Cheng Laboratory (PCL)
    Guiding PCL researchers to develop an communication-efficient geo-distributed machine learning system.
    • The developed system was adopted by PCL.

Education

  • 2021 - 2022

    Singapore

    Visiting Scholar
    Nanyang Technological University
    School of Computer Science and Engineering
  • 2018 - 2018

    Oxford, UK

    Visiting Scholar
    University of Oxford
    Lady Margaret Hall
  • 2014 - 2024

    Chengdu, CN

    Bachelor and PhD
    University of Electronic Science and Technology of China
    School of Information and Communication Engineering

Awards

Talks

Projects

  • 2024 - Now
    Prima.cpp - A distributed inference system serving 70B-scale LLMs on mobile devices with piped ring parallelism and automatic layer assignment
    Prima.cpp is a distributed inference system built on llama.cpp. Unlike existing on-device inference systems that assume sufficient memory, user devices often lack the total memory required to run 70B-scale models. While llama.cpp uses mmap to avoid OOMs, this approach incurs significant disk I/O latency. To address this, prima.cpp designs a piped ring parallel architecture that runs model layers in a circle and overlaps disk loading with operations on other devices. However, assigning model layers to heterogeneous devices is challenging due to heterogeneity in computing hardware, memory, disk, and OS, making inference latency hard to predict. Prima.cpp implements a layer-to-device scheduler that models these factors and optimizes overall inference latency while considering memory constraints and disk loading delays. In a setup using 4 user devices (laptop, tablet, smartphone, and desktop, with a combined 30GB of memory), prima.cpp achieves an inference latency of 1 second per token for Llama-3 70B. Further optimizations are underway.
  • 2024 - 2024
    TPI-LLM - A distributed inference system serving 70B-scale LLMs on mobile devices with tensor parallelism and sliding window memory scheduling
    TPI-LLM is a LLM serving system designed to bring LLM functions to low-resource mobile devices. While cloud LLM services have achieved great success, privacy concerns arise and users do not want their conversations uploaded to the cloud as these conversations could involve sensitive personal information. Our TPI-LLM system addresses the privacy issue by enabling LLM inference on mobile devices with limited computing and memory resources. The system leverages multiple mobile devices to perform inference through tensor parallelism, combined with a sliding window memory scheduler to reduce peak memory footprint. Currently, TPI-LLM can run Yi-34B in full precision on 4 laptops with 5GB of memory on each laptop, and run Llama 2-70B on 8 devices with 3GB of memory on each device. Furthermore, TPI-LLM has demonstrated 80%-90% less TTFT and token latency compared to Transformers, Accelerate, Galaxy, and 43%-55% less compared to llama.cpp on larger models (>13B).
  • 2018 - 2023
    GeoMX - Accepted and adopted by ZTE Co., Ltd.
    GeoMX is a fast and unified distributed system for training ML models over geographical data centers, which offers 20x speedup under identical network conditions.
  • 2022 - 2024
    NetStorm - Accepted by IEEE/ACM TON (CCF A)
    NetStorm is an topology-adaptive and communication-efficient system designed for geo-distributed machine learning training, which achieves a speedup of 7.5~9.2 times over the standard GeoMX system.
  • 2023 - 2024
    KlonetAI - An intelligent agent adopted by a work accepted by NSDI 24 (CCF A)
    Klonet is designed to support the deployment and testing of new network protocols and applications in a realistic environment, such as distributed machine learning and federated learning, and KlonetAI provides an AI agent for intelligent interaction with the Klonet platform.
  • 2022 - 2023
    AGOD - AI-generated optimization decision accepted by IEEE TMC (CCF A)
    This project is an implementation of the system design and the deep diffusion soft actor-critic (D2SAC) algorithm
  • 2022 - 2023
    PerSF-SemCom - Personalized saliency-based semantic communication accepted by IEEE JSAC (CCF A)
    This project implements an energy-efficient task-oriented semantic communication framework with a triple-based scene graph for image information at the semantic level, and then designs a personalized semantic encoder based on user interests to meet the requirements of personalized saliency.
  • 2019 - 2021
    NBSync - An asynchronous pipelining scheduler accepted by IEEE TSC (CCF A)
    NBSync is a novel training algorithm for distributed ML over WANs, which greatly speeds up the model training by the parallelism of local computing and global synchronization. NBSync employs a well-designed pipelining scheme, which relaxes the sequential dependency of local computing and global synchronization and process them in parallel so as to overlap their operating overhead in the time dimension. NBSync also realizes flexible, differentiated and dynamical local computing for workers to maximize the overlap ratio in dynamically heterogeneous training environments.
  • 2018 - 2019
    ESync - An efficient DML synchronization algorithm accepted by IEEE TSC (CCF A)
    ESync is an efficient synchronization algorithm designed for distributed ML tasks in heterogeneous clusters (the cluster consists of computing devices with different computing capabilities).
  • 2018 - 2025
    Other Programs
    These programs are close sourced due to IP and confidentiality protocols.
    • 2018-2020: Advanced Distributed Machine Learning Techniques. Provincial and Ministerial Key Program. Approved.
    • 2018-2019: Advanced Data Center Network Architectures. Huawei Technologies Co., Ltd. Approved.
    • 2019-2020: Communication Optimizations for Distributed Machine Learning over WANs. Peng Cheng Laboratory. Approved.
    • 2021-2025: Computing Power Network and New Communication Primitives. ZTE Communication Co., Ltd. In progress.
    • 2022-2023: Accelerating Data Transmission for Geographically Distributed Machine Learning. Zhejiang Lab. Approved.
    • 2022-2023: Advanced Network Technologies for Giant Connections, Large Traffic, and Low Latency in the Rapid Evolution of 5G/B5G. National Key Research and Development Program. Approved.