A caption can be grammatically correct and still miss the mark entirely. A cultural reference that doesn't land, an idiomatic expression used in the wrong register, a description of visual content that a native speaker would never phrase that way - these are the failures that damage product credibility in a market, and they're invisible to anyone who doesn't live in that language.
Launching without validated caption quality data meant risking failures that would only surface after the product was already in front of international users.
There was a second risk: governance. Evaluation data collected under inconsistent rubrics across 45 languages isn't comparable - and inconsistent data can't drive model improvement. The value of the evaluation depended entirely on every reviewer applying the same criteria.

