Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

K. Behdin, A. Fatahibaarzi, Q. Song, Y. Dai, A. Gupta, Z. Wang, et al. (LinkedIn / MIT)

EMNLP 2025 (Industry Track) · 2025 · ★★★★☆4/5

My reading notes

Why it matters

This is a concrete, production-validated recipe for the exact systems problem Arun's MIRROR platform faces: compressing a large foundation model and serving it under tight latency on H100s. The distillation + structured pruning + quantization + prefix-caching stack maps directly to LLM inference serving, model scaling, and the train-on-Slurm/serve-on-K8s split.

Summary

LinkedIn engineers describe how they take an internal 100B+ Mixture-of-Experts foundation model (FM) for recommendation ranking, built on a Mixtral-style architecture with Llama-3.1-8B experts and text-based featurization, and compress it into small language models (SLMs) that meet online latency budgets. The core pipeline has three stages: distill the full model, apply one-shot structured pruning to cut size, then re-distill the pruned model to recover quality. They report more than 20x size reduction for predictive ranking tasks and over 5x for a reasoning task, with only modest accuracy loss measured in AUC (predictive) and validation loss / task metrics (generative).

On the training side, the paper is a careful ablation of distillation recipes. Knowledge distillation with teacher logit supervision consistently beats plain SFT at preserving FM quality (e.g., an 8B student loses only -0.06% AUC under KD vs -0.62% under SFT). For reasoning, a two-stage recipe (word-level forward-KL distillation followed by on-policy distillation, oFKL) outperforms single-stage, and they find non-obvious results: a smaller teacher can beat a larger one, students can surpass teachers, and tuned on-policy sampling fraction, generation length (~300 tokens), and temperature (0.8-0.9) all matter. Structured pruning uses the OSSCAR algorithm on MLP neurons and attention heads; gradual multi-step pruning with KD between steps recovers AUC better than one-shot, achieving near-lossless 3B-to-2.4B compression.

The deployment half is the systems payoff. They serve on 8x H100 nodes using SGLang with tensor parallelism, FP8 weight+activation quantization, FlashInfer attention, and RadixAttention prefix caching. Because ranking generates a single output token, the workload is prefill-dominated, so TTFT is the key metric; sharing a long per-user prefix across k candidate items ("hot prefill" via KV-cache reuse) means ranking more items barely raises latency. Pruning attention heads cuts attention latency ~40% and prefill ~28%. For the generative reasoning use case they compare quantization schemes across H100 and older A100 GPUs (FP8 best on H100; INT4 W4A16 best for decode on A100, INT8 W8A8 better for prefill-heavy work), and report a 20%+ online quality lift from KD plus data changes in a live A/B test.

Key ideas

Takeaways for my work

LLM inference servingmodel compressionknowledge distillationstructured pruningrecommendation systems