LLM Query Scheduling with Prefix Reuse and Latency Constraints

G. Dexter, S. Tang, A. Fatahi Baarzi, Q. Song, T. Dharamsi, A. Gupta (LinkedIn; Nubank)

NeurIPS 2025 (Poster); also arXiv 2502.04677 · 2025 · ★★★★☆4/5

My reading notes

Why it matters

Directly relevant to Arun's ML-systems work on LLM inference serving and scheduling: it formalizes when prefix-cache (RadixAttention) scheduling helps or hurts tail latency, and gives a tunable, drop-in scheduler (k-LPM) for SGLang/vLLM-style serving stacks like the ones that would serve his MIRROR platform.

Summary

This paper studies how to order incoming LLM queries when the serving engine reuses shared KV-cache prefixes (RadixAttention, the radix-tree prefix-reuse mechanism behind SGLang). The authors build a simple roofline-inspired cost model for prefill-dominated inference in which a query's processing time scales with its length minus the prefix it shares with the previously processed query, plus a c_attn term that captures the balance between linear FFN cost and quadratic attention cost. Within this model they prove that deciding whether a stream of timestamped queries can be scheduled to meet a per-query time-to-first-token (TTFT) constraint is NP-hard, via reduction from 3-PARTITION. This is a sharp contrast to the easy cases: with no prefix reuse, FCFS is optimal, and with uniform arrival times, longest-prefix-match (LPM) is optimal. So prefix reuse combined with non-uniform online arrivals is what makes scheduling genuinely hard.

To get a usable result despite the hardness, they introduce a structured data-generative model (a regular arrival shuffled queue) that mirrors real prefix-sharing workloads: a shared base/user prefix plus a per-query unique document, i.e. a height-two prefix tree. They propose k-LPM, which interleaves FCFS-style fairness with LPM-style reuse: process the oldest queued query, then take k-1 greedy longest-prefix-match steps. It reduces to FCFS at k=1 and LPM at k=infinity. They prove that under the generative model k-LPM achieves lower worst-case TTFT than both FCFS and LPM simultaneously for a range of inter-arrival gaps and prefix lengths, and that an approximation algorithm exists for the (1-p)-percentile TTFT constraint running in O(n*exp(1/p log 1/p)) time.

Empirically they serve Llama-3.1-8B-Instruct on SGLang v0.4.1 across eight A100 GPUs, using an industrial 360Brew-style prompt set (shared instruction/profile/history prefix, varying question). A small k (k=2) gives the best P99 TTFT across a wide range of Poisson request rates, beating both FCFS and LPM, and the empirical behavior tracks the theory even where the theoretical assumptions are relaxed. The takeaways match intuition: FCFS wins at low load, LPM wins at high load, and k-LPM keeps the advantage across the spectrum with a single tunable knob.

Key ideas

Takeaways for my work

LLM inference servingschedulingprefix caching / RadixAttentionlatency / TTFTML systems