WMT 2024 · 2024
Assessing the Role of Imagery in Multimodal Machine Translation
+7% image-grounding scoreWe design imagery-sensitive contrastive metrics for multimodal machine translation and apply them to state-of-the-art architectures used at WMT 2024. The study shows that translations degrade when the paired image contradicts the caption, indicating that the models depend on visual evidence instead of treating images as mere regularizers. We release the evaluation harness and curated counterfactual splits to help teams audit multimodal MT deployments.
Highlights
- Introduced imagery-aware contrastive probes that isolate whether translations truly reference the paired visual context.
- Benchmarked nine multimodal MT systems and showed genuine visual grounding across high-variance evaluation splits.
Artifacts & reproduction
Evaluation harness for multimodal MT, selective LLM routing, and visual-text calibration experiments.
- Run imagery-aware contrastive probes against WMT-style checkpoints.
- Benchmark LLM reject-option heads on held-out OOD prompts.
- Log structured reports (HTML/Markdown) for rapid model comparisons.
- Used to benchmark LLM reject-option heads and multimodal MT models for AFRL and academic deployments.