Evaluating Machine Translation for Municipal Use Cases
Written by Meng Hui (Kiara) Liu
The Challenge
New York City is home to speakers of over 800 languages. According to data from the American Community Survey, as of 2023, about 30% of New Yorkers speak a language other than English at home, and around 1.8 million have limited English proficiency (LEP). To ensure that New Yorkers of varying English abilities have access to crucial government information, NYC’s Local Law 30 of 2017 requires that covered city agencies “appoint a language access coordinator, develop language access implementation plans, provide telephonic interpretation in at least 100 languages, translate their most commonly distributed documents into the 10 designated citywide languages, and post signage about the availability of free interpretation services, among other requirements.”
City agencies often rely on external vendors for translation, as managing translation in-house requires substantial staff time and linguistic expertise. However, there exists no standard for evaluating translation vendors, which leads to varying quality. Moreover, given recent advancements in large language models, translation vendors frequently offer services that incorporate machine translation into their pipeline.
As a Siegel PiTech PhD Impact Fellow this summer, I worked with the Research and Collaboration team at the New York City Office of Technology and Innovation (OTI) to develop a practical framework for assessing machine translation. Even though a wealth of academic research exists on the subject of machine translation quality evaluation (MTQE), it does not necessarily translate well (pun intended) into practical settings such as the City’s use cases. Overly technical metrics may be too difficult for language access coordinators and in-house linguists to implement. With this in mind, in my project, I evaluated both qualitative and quantitative methods and provided actionable recommendations to guide agencies’ evaluation of machine translation vendors.
My Project
As a translator myself, I was excited to work on this project! Human translators tend to evaluate translations in a much more nuanced way compared to computer scientists. Since any framework that I recommended would be used by human linguists and translators, I knew that it should resonate with how they naturally tend to conduct the evaluations.
I started my project with a literature review of qualitative MTQE methods. These range from the classic ideas of dynamic and formal equivalence, to the Chinese 信达雅 (“fidelity, expressiveness, elegance”) triad, to evaluation checklists used in translator training. I also reviewed translation evaluation guidelines used in other cities.
To make sure that my framework was scalable and not overly subjective, I also referenced quantitative methods of translation evaluation. Quantitative methods can be further broken down into manual quantitative methods (such as human-scored axes and rubrics) to automatic quantitative methods (such as word error rate, BLEU, or LLM-based methods).
In the end, I developed an evaluation rubric with five categories:
Content accuracy: whether all factual information from the original text is preserved in the translation
Tone & formality: whether the translation matches the original text’s tone and level of formality
Readability: whether the translation flows naturally and is easy to read
Presentation: whether formatting, document structure, and markdown elements are preserved
Respect: whether respectful language is used consistently throughout the translation
In my rubric, categories could be scored on a 4-point scale:
Good: No errors
Needs improvement: There are a few errors, but errors can be easily identified and corrected
Poor: Errors are too numerous or challenging to correct
Catastrophic: One or multiple errors would actively create harm, either for LEP users or for the City’s reputation
Moving forward, the framework will continue to evolve, depending on specific use cases. I recommend that agencies discuss with their language staff to determine which categories should be prioritized for their needs and what the cutoff score should be. For instance, a tool used to translate social media posts and news headlines may be held to different standards than one used for brochures containing immigration resources and guidance. OTI will also be adapting this framework to evaluate machine interpretation technology, which presents its own set of challenges and priorities.
Impact and Path Forward
The final deliverables for my project included:
An MTQE rubric to recommend to agencies, along with a supplementary document explaining the rubric and outlining best practices
A report that summarized research on automatic MTQE methods
A presentation for an AI community of practice for city employees on recommendations for evaluating machine translation tools
A resource library for model translations, to support evaluation work
Meng Hui Liu (Kiara)
Ph.D. Student, Information Science, Cornell University
My work this summer has given NYC agencies and language access staff an overview of different MTQE methods and practical considerations for how to conduct MTQE for their specific needs. My deliverables will be used to inform future policy recommendations and guide evaluation best practices going forward.