Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Gemini Embedding Team, Google DeepMind (M. Shanbhogue, Z. Li, S. Zhang, G. Hernández Ábrego, et al.)

arXiv 2026 (v1), tech report · 2026 · ★★★½☆3.5/5

My reading notes

Why it matters

A single foundation-model-initialized encoder that produces a unified retrieval space across modalities is directly useful for Arun's MIRROR RAG/serving stack and for cross-modal anomaly retrieval; the multi-stage contrastive recipe and synthetic-data ablations are a clean reference for training general-purpose embedders. Native audio beating ASR cascades is a concrete argument against pipelined preprocessing that generalizes to his spatiotemporal pipelines.

Summary

This is a Google DeepMind technical report introducing Gemini Embedding 2, an embedding model that natively encodes text, image, video, and audio (and arbitrary interleaved combinations) into a single high-dimensional vector space (up to 3072 dims, with Matryoshka support optimized for 768 and 1536). Rather than CLIP-style late fusion with separate per-modality encoders, the model is initialized from a multimodal Gemini backbone, given bidirectional attention, and adapted into an encoder: tokens go through the transformer, are mean-pooled, then linearly projected to the target dimension. Treating the Gemini init as the "pre-training" stage is the core architectural bet, since deep cross-modal fusion is inherited rather than learned from paired data alone.

Training is multi-task and multi-stage with an NCE / in-batch-negatives contrastive loss (cosine similarity, temperature, optional hard negatives, plus a masking term for classification-style tasks with few labels). The recipe has three phases: Pre-Fine-Tuning on large noisy query-target pairs over image/text/code with big batches; Fine-Tuning over text, code, document, image, audio, and video tasks (many with hard-negative triplets), with per-task batch sizes and empirically tuned sampling rates; and a final Model Soup stage that averages checkpoints across runs for extra generalization. A randomly dropped task-string prefix improves robustness when no task instruction is given.

The model reports state-of-the-art or near-SOTA across a broad benchmark suite: 69.9 mean on MTEB Multilingual and 84.0 on MTEB Code (vs 76.0 for the prior text-only Gemini Embedding, and ahead of domain-specific voyage-code-3), plus strong cross-modal retrieval (62.9 R@1 on MSCOCO text-to-image, a 91.2 text-to-image mean and a 63.1 image-to-text mean across the four caption benchmarks, and 68.8 NDCG@10 text-to-video on Vatex). Document retrieval on ViDoRe V2 (64.9) is competitive with Voyage and ahead of Amazon Nova MME. Two findings stand out for systems work: native audio retrieval beats an ASR-to-text cascade (73.99 vs 70.40 mrr@10 average, with the cross-lingual gap widening to 72.56 vs 67.55 because the embedding preserves acoustic ambiguity instead of committing to a hard transcription), and Gemini-synthesized training data drives a large jump on hard MTEB Code tasks (the w/-synthetic model reaches 86.3, a +15.8-point average gain over the prior text-only Gemini Embedding on the three selected tasks). Zero-shot robustness in specialized domains (microscopy, astronomy, fine art, culinary) is notably more stable across domains than CLIP/SigLIP2/TIPS baselines.

Key ideas

Native (early-fusion) multimodal embedding: one Gemini-initialized bidirectional transformer encodes text/image/video/audio and interleaved combos, vs CLIP-style modality-specific late fusion
Encoder built by mean-pooling token embeddings then a linear projection; uses Matryoshka (MRL) multi-loss so one model serves 768/1536/3072-dim embeddings
Multi-stage recipe: Pre-Fine-Tuning (noisy image/text/code pairs, large batches) then Fine-Tuning (hard-negative triplets, per-task batch/sampling tuning) then Model Soup checkpoint averaging
Contrastive NCE loss with in-batch negatives, optional hard negatives, a same-query/label masking term, and randomly dropped task-string prefixes for robustness
Native audio retrieval beats an ASR-then-encode cascade (73.99 vs 70.40 mrr@10 average), with a larger cross-lingual gain (72.56 vs 67.55) because the embedding avoids hard transcription errors
Strong cross-modal retrieval (62.9 MSCOCO text-to-image R@1, 91.2 text-to-image mean, 63.1 image-to-text mean) and a +15.8-pt synthetic-data gain on selected MTEB Code tasks; zero-shot domain transfer (astronomy, microscopy, art, recipes) far more stable than CLIP/SigLIP2/TIPS

Takeaways for my work

For MIRROR's RAG/serving layer, a unified multimodal embedder removes per-modality encoder plumbing; the mean-pool + linear-projection + MRL pattern is a low-cost way to ship one model at multiple dim budgets for latency/cost tradeoffs
The native-audio-beats-cascade result is a transferable argument: avoid lossy preprocessing bottlenecks (e.g., discretizing or transcribing a signal) when a single encoder can preserve the raw representation; relevant to spatiotemporal/sensor pipelines
The PFT to FT to model-soup recipe plus synthetic-data ablation is a reusable blueprint for training a general-purpose embedder where labeled data is scarce, which mirrors Arun's rare-anomaly setting
Cross-domain stability (no sharp peaks/valleys vs baselines) is the metric to watch for a trustworthy embedding space; a useful evaluation framing for anomaly/retrieval reliability claims

multimodal embeddingscontrastive learningretrieval / RAGfoundation modelsML systems