Evaluating Machine Translation for Municipal Use Cases

Written by Meng Hui (Kiara) Liu

The Challenge

New York City is home to speakers of over 800 languages. According to data from the American  Community Survey, as of 2023, about 30% of New Yorkers speak a language other than English at home, and around 1.8 million have limited English proficiency (LEP). To ensure that New Yorkers of varying English abilities have access to crucial government information, NYC’s Local Law 30 of 2017 requires that covered city agencies “appoint a language access  coordinator, develop language access implementation plans, provide telephonic interpretation in  at least 100 languages, translate their most commonly distributed documents into the 10  designated citywide languages, and post signage about the availability of free interpretation  services, among other requirements.” 

City agencies often rely on external vendors for translation, as managing translation in-house requires substantial staff time and linguistic expertise. However, there exists no standard for evaluating translation vendors, which leads to varying quality. Moreover, given recent advancements in large language models, translation vendors frequently offer services that incorporate machine translation into their pipeline. 

As a Siegel PiTech PhD Impact Fellow this summer, I worked with the Research and Collaboration team at the New York City Office of Technology and Innovation (OTI) to develop a practical framework for assessing machine translation. Even though a wealth of academic research exists on the subject of machine translation quality evaluation (MTQE),  it does not necessarily translate well (pun intended) into practical settings such as the City’s use cases. Overly technical metrics may be too difficult for language access coordinators and in-house linguists to implement. With this in mind, in my project, I evaluated both qualitative and quantitative methods and provided actionable recommendations to guide agencies’ evaluation of machine translation vendors. 

My Project

As a translator myself, I was excited to work on this project! Human translators tend to evaluate translations in a much more nuanced way compared to computer scientists. Since any framework that I recommended would be used by human linguists and translators, I knew that it should resonate with how they naturally tend to conduct the evaluations. 

I started my project with a literature review of qualitative MTQE methods. These range from the classic ideas of dynamic and formal equivalence, to the Chinese 信达雅 (“fidelity, expressiveness, elegance”) triad, to evaluation checklists used in translator training. I also reviewed translation evaluation guidelines used in other cities.  

To make sure that my framework was scalable and not overly subjective, I also referenced quantitative methods of translation evaluation. Quantitative methods can be further broken down into manual quantitative methods (such as human-scored axes and rubrics) to automatic quantitative methods (such as word error rate, BLEU, or LLM-based methods). 

In the end, I developed an evaluation rubric with five categories: 

  • Content accuracy: whether all factual information from the original text is preserved in the translation 

  • Tone & formality: whether the translation matches the original text’s tone and level of formality 

  • Readability: whether the translation flows naturally and is easy to read 

  • Presentation: whether formatting, document structure, and markdown elements are preserved 

  • Respect: whether respectful language is used consistently throughout the translation

In my rubric, categories could be scored on a 4-point scale: 

  1. Good: No errors 

  2. Needs improvement: There are a few errors, but errors can be easily identified and corrected 

  3. Poor: Errors are too numerous or challenging to correct 

  4. Catastrophic: One or multiple errors would actively create harm, either for LEP users or for the City’s reputation 

Moving forward, the framework will continue to evolve, depending on specific use cases. I recommend that agencies discuss with their language staff to determine which categories should be prioritized for their needs and what the cutoff score should be. For instance, a tool used to translate social media posts and news headlines may be held to different standards than one used for brochures containing immigration resources and guidance. OTI will also be adapting this framework to evaluate machine interpretation technology, which presents its own set of challenges and priorities.

Impact and Path Forward

The final deliverables for my project included: 

  • An MTQE rubric to recommend to agencies, along with a supplementary document explaining the rubric and outlining best practices 

  • A report that summarized research on automatic MTQE methods 

  • A presentation for an AI community of practice for city employees on recommendations for evaluating machine translation tools 

  • A resource library for model translations, to support evaluation work 

Meng Hui Liu (Kiara)

Ph.D. Student, Information Science, Cornell University

My work this summer has given NYC agencies and language access staff an overview of different MTQE methods and practical considerations for how to conduct MTQE for their specific needs. My deliverables will be used to inform future policy recommendations and guide evaluation best practices going forward.

Next
Next

Exploring LLM Summarization Techniques to Make Community Board Meetings More Accessible