LLM Query Scheduling with Prefix Reuse and Latency Constraints

G. Dexter, S. Tang, A. Fatahi Baarzi, Q. Song, T. Dharamsi, A. Gupta (LinkedIn; Nubank)

NeurIPS 2025 (Poster); also arXiv 2502.04677 · 2025 · ★★★★☆4/5

My reading notes

Why it matters

Directly relevant to Arun's ML-systems work on LLM inference serving and scheduling: it formalizes when prefix-cache (RadixAttention) scheduling helps or hurts tail latency, and gives a tunable, drop-in scheduler (k-LPM) for SGLang/vLLM-style serving stacks like the ones that would serve his MIRROR platform.

Summary

This paper studies how to order incoming LLM queries when the serving engine reuses shared KV-cache prefixes (RadixAttention, the radix-tree prefix-reuse mechanism behind SGLang). The authors build a simple roofline-inspired cost model for prefill-dominated inference in which a query's processing time scales with its length minus the prefix it shares with the previously processed query, plus a c_attn term that captures the balance between linear FFN cost and quadratic attention cost. Within this model they prove that deciding whether a stream of timestamped queries can be scheduled to meet a per-query time-to-first-token (TTFT) constraint is NP-hard, via reduction from 3-PARTITION. This is a sharp contrast to the easy cases: with no prefix reuse, FCFS is optimal, and with uniform arrival times, longest-prefix-match (LPM) is optimal. So prefix reuse combined with non-uniform online arrivals is what makes scheduling genuinely hard.

To get a usable result despite the hardness, they introduce a structured data-generative model (a regular arrival shuffled queue) that mirrors real prefix-sharing workloads: a shared base/user prefix plus a per-query unique document, i.e. a height-two prefix tree. They propose k-LPM, which interleaves FCFS-style fairness with LPM-style reuse: process the oldest queued query, then take k-1 greedy longest-prefix-match steps. It reduces to FCFS at k=1 and LPM at k=infinity. They prove that under the generative model k-LPM achieves lower worst-case TTFT than both FCFS and LPM simultaneously for a range of inter-arrival gaps and prefix lengths, and that an approximation algorithm exists for the (1-p)-percentile TTFT constraint running in O(n*exp(1/p log 1/p)) time.

Empirically they serve Llama-3.1-8B-Instruct on SGLang v0.4.1 across eight A100 GPUs, using an industrial 360Brew-style prompt set (shared instruction/profile/history prefix, varying question). A small k (k=2) gives the best P99 TTFT across a wide range of Poisson request rates, beating both FCFS and LPM, and the empirical behavior tracks the theory even where the theoretical assumptions are relaxed. The takeaways match intuition: FCFS wins at low load, LPM wins at high load, and k-LPM keeps the advantage across the spectrum with a single tunable knob.

Key ideas

Prefix-aware online scheduling is NP-hard: deciding feasibility of per-query TTFT constraints under RadixAttention reduces from 3-PARTITION, even though FCFS (no reuse) and LPM (uniform arrivals) are each optimal in their easy special cases.
A tractable prefill-dominated cost model: each query's cost is proportional to its length minus the maximal prefix overlap with the previously processed query, with a c_attn constant capturing the FFN-vs-quadratic-attention balance for a fixed architecture.
k-LPM generalizes FCFS and LPM: per Algorithm 1, process the oldest query, then do k-1 greedy longest-prefix-match steps; k=1 is FCFS, k=infinity is LPM, and k trades fairness/waiting-time against cache reuse.
Provable improvement under a realistic tree-structured workload model (a regular arrival shuffled queue: shared user prefix + unique doc, a height-two tree): k-LPM beats both FCFS and LPM on worst-case TTFT for a range of arrival gaps and prefix lengths.
Existence of a percentile-TTFT approximation algorithm running in O(n*exp(1/p log 1/p)) time that either certifies infeasibility or returns a schedule meeting the (1-p)-percentile constraint.
Real serving validation: Llama-3.1-8B-Instruct on SGLang v0.4.1 over 8xA100; k=2 gives best P99 TTFT across Poisson request rates, FCFS wins low load, LPM wins high load, k-LPM wins broadly.

Takeaways for my work

For MIRROR's K8s serving plane: prefix-cache scheduling order is a real tail-latency lever, and k-LPM is a low-effort SGLang patch (implemented as a minor extension of the existing LPM scheduler) worth trying when prompts share a base/system prefix.
The 'oldest-then-k-greedy' pattern is a clean, transferable recipe for balancing fairness against cache reuse; the same FCFS-vs-greedy tension shows up in any batch system with shared-work reuse, not just LLM prefill.
Choose k empirically (k=2 or 3 worked well even when it did not match the true prefix replica count of 4); tune via back-testing since the optimal k depends on load and prefix structure.
Useful framing for systems interviews/RS roles: prefix reuse turns an easy FCFS scheduling problem NP-hard, a concrete example of how caching/memoization couples queue items and breaks naive optimality.

LLM inference servingschedulingprefix caching / RadixAttentionlatency / TTFTML systems