arXiv 2026 (v1), tech report · 2026 ·
My reading notes
A single foundation-model-initialized encoder that produces a unified retrieval space across modalities is directly useful for Arun's MIRROR RAG/serving stack and for cross-modal anomaly retrieval; the multi-stage contrastive recipe and synthetic-data ablations are a clean reference for training general-purpose embedders. Native audio beating ASR cascades is a concrete argument against pipelined preprocessing that generalizes to his spatiotemporal pipelines.
This is a Google DeepMind technical report introducing Gemini Embedding 2, an embedding model that natively encodes text, image, video, and audio (and arbitrary interleaved combinations) into a single high-dimensional vector space (up to 3072 dims, with Matryoshka support optimized for 768 and 1536). Rather than CLIP-style late fusion with separate per-modality encoders, the model is initialized from a multimodal Gemini backbone, given bidirectional attention, and adapted into an encoder: tokens go through the transformer, are mean-pooled, then linearly projected to the target dimension. Treating the Gemini init as the "pre-training" stage is the core architectural bet, since deep cross-modal fusion is inherited rather than learned from paired data alone.
Training is multi-task and multi-stage with an NCE / in-batch-negatives contrastive loss (cosine similarity, temperature, optional hard negatives, plus a masking term for classification-style tasks with few labels). The recipe has three phases: Pre-Fine-Tuning on large noisy query-target pairs over image/text/code with big batches; Fine-Tuning over text, code, document, image, audio, and video tasks (many with hard-negative triplets), with per-task batch sizes and empirically tuned sampling rates; and a final Model Soup stage that averages checkpoints across runs for extra generalization. A randomly dropped task-string prefix improves robustness when no task instruction is given.
The model reports state-of-the-art or near-SOTA across a broad benchmark suite: 69.9 mean on MTEB Multilingual and 84.0 on MTEB Code (vs 76.0 for the prior text-only Gemini Embedding, and ahead of domain-specific voyage-code-3), plus strong cross-modal retrieval (62.9 R@1 on MSCOCO text-to-image, a 91.2 text-to-image mean and a 63.1 image-to-text mean across the four caption benchmarks, and 68.8 NDCG@10 text-to-video on Vatex). Document retrieval on ViDoRe V2 (64.9) is competitive with Voyage and ahead of Amazon Nova MME. Two findings stand out for systems work: native audio retrieval beats an ASR-to-text cascade (73.99 vs 70.40 mrr@10 average, with the cross-lingual gap widening to 72.56 vs 67.55 because the embedding preserves acoustic ambiguity instead of committing to a hard transcription), and Gemini-synthesized training data drives a large jump on hard MTEB Code tasks (the w/-synthetic model reaches 86.3, a +15.8-point average gain over the prior text-only Gemini Embedding on the three selected tasks). Zero-shot robustness in specialized domains (microscopy, astronomy, fine art, culinary) is notably more stable across domains than CLIP/SigLIP2/TIPS baselines.