WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

G. Winata, F. Hudi, P. A. Irawan, D. Anugraha, R. A. Putri et al. (60+ authors; senior authors incl. D. I. Adelani, A. Oh, A. F. Aji, T. Watanabe, C.-W. Ngo)

arXiv 2024 (v5, May 2025); NAACL 2025 · 2025 · ★★½☆☆2.5/5

My reading notes

Why it matters

Tangential to Arun's core spatial/physics-ML work, but its adversarial-context protocol (feeding a wrong location to flip a model's prediction) and its native-speaker-verified data-curation pipeline are useful reference points for trustworthy multimodal evaluation and for building benchmarks with reliable ground truth.

Summary

WorldCuisines introduces a large multilingual, multicultural visual question answering (VQA) benchmark built around food as a proxy for culture. The authors assemble a curated knowledge base of 2,414 dishes with 6,045 Wikimedia-licensed images and rich metadata (coarse and fine categories, cuisine, location, regional cuisine), then construct a parallel VQA corpus of roughly one million question-image pairs spanning 30 language varieties (23 languages, 7 of them with two varieties each) across nine language families and 189 countries. Two tasks are posed: predicting a dish's name and predicting where it is commonly eaten and originated, each with multiple-choice and open-ended answer formats. Distractors for the multiple-choice items are mined with a multilingual text-embedding model and cosine similarity over dish name plus description.

A notable design element is the three-way query design for dish-name prediction: no-context, contextualized (helpful), and adversarial. In the adversarial setting the prompt injects a misleading location or cuisine, testing whether a model anchors on the image or gets steered by the misleading text. The translation pipeline is handled by native speakers and is careful about morphological inflection (for example case marking on place names), prioritizing naturalness.

The authors evaluate 18 VLMs (15 open-source, 3 proprietary). Multiple-choice accuracy varies widely while open-ended is much harder, especially for dish names without context. GPT-4o leads overall and Llama 3.2 Instruct is the strongest open model. Helpful context reliably boosts accuracy, but adversarial context reliably degrades it, indicating models lean on textual cues over visual evidence. Performance is weaker for underrepresented and non-Latin-script languages, and a clear scaling trend (larger models do better) holds, most clearly for the open-source families. The training split (1M), two evaluation splits (12k and 60k), code, leaderboard, and knowledge base are released publicly on HuggingFace and GitHub.

Key ideas

Takeaways for my work

vision-language modelsmultilingual benchmarkVQAadversarial robustnesscultural NLP