Presentation overview
This is a transcribed version of my presentation on introducing a new metric system for evaluating and enhancing literary machine translation (MT). This system addresses the current limitations of MT metrics, which often miss the nuances essential for literary works. Inspired by the ScandEval project, our approach uses automated benchmarking and visualization tools to provide consistent and objective assessments.
Key proposed metrics
Grammatical Accuracy (GA)
Fluency Score (FS)
Semantic Accuracy (SA)
Stylistic Coherence (Voice Fidelity) (SCVF)
Plot Coherence (PC)
Cultural Sensitivity (CS)
By combining automation with human expertise, we aim to improve translation quality and advocate for fair compensation for translators irregearless of translation method.
The presentation was given at the symposium "Lost in Transl:AI:tion: Implications of machine translation for communication and comprehension", 30 May 2024 at University of Copenhagen, Department of English, Germanic and Romance Studies.
Transcribed presentation
The following text is transcribed audio (using openAI) from my rehersal presentation that has been cleaned into readable notes with added subheadings for overview.
Title:
Rethinking Metrics for Evaluating and Enhancing Literary Machine Translation
Introduction
Hi everyone. My name is Tascha, and I’m delighted to be here today to discuss my work in the field of literary translation.
Purpose of the Presentation
The goal of today’s presentation is to introduce a new proof-of-concept metric system designed to evaluate and improve literary machine translation (MT). We’ll discuss the current challenges, introduce the proposed metrics, go through the methodology, review case studies, and conclude with some thought-provoking insights.
Background
To give you a bit of background about myself: I'm a certified translator, a lexicographer specializing in natural language processing (NLP), and I've worked extensively with data and metadata in the book publishing industry.
Four years ago, I moved into trade book publishing and was both amazed and surprised by the industry's lack of innovation—or rather, the reluctance to openly discuss technological advancements and how they affect the book industry.
Economic Initiatives vs. Quality
Economic initiatives alone are not enough to justify reducing "real translations." While cost-saving measures and efficiency gains from machine translation are appealing, they must not come at the expense of quality, cultural integrity, and the nuanced artistry that human translators bring to literary works.
However, there are areas where machine translation can be the right solution, such as providing sample translations in the rights acquisition phase. This can streamline the process, allowing publishers to quickly assess the potential of a work before committing to a full human-translated version.
Aim of the New Metric System
My aim is not to make the translator redundant but to provide opportunities that help translators survive and thrive in a world where technology is often applied without considering the consequences of removing the human element from literary translation.
Additionally, I want to give publishing houses a realistic and well-crafted tool to help them make the right decisions when bringing translated books to the Danish market.
Framework for Collaboration
The framework also facilitates transparent discussions with translators, setting realistic expectations and budgets. Ultimately, it helps achieve better translations while fostering a collaborative environment between publishers and translators.
Current Challenges in Literary Machine Translation
Balancing time, money, and quality is a significant challenge in literary translation. The process is inherently time-consuming, often requiring months of meticulous work to ensure accuracy and fidelity to the original text.
High-quality translations can also be expensive, as skilled human translators need to be appropriately compensated for their expertise and effort.
Limitations of Existing MT Evaluation Metrics
Let’s begin by addressing the limitations of existing MT evaluation metrics.
Metrics like BLEU, WER, and TER primarily focus on accuracy.
BLEU measures the overlap of n-grams between the translation and reference text,
WER calculates the number of word-level edits needed to match the reference.
TER assesses the number of edits required to change the translation into the reference text.
While they are useful in other contexts, they fall short when applied to literary translation because they neglect stylistic nuances, tone, and cultural subtleties, which are essential to preserving the essence of literary works.
Need for a Specialized Approach
Literary translation requires a specialized approach. It's not just about translating words; it’s about capturing the author’s unique voice and maintaining cultural context. It’s about recreating the literary work in a new cultural context.
Inspiration from ScandEval
I’d like to highlight how the ScandEval project has inspired our approach.
ScandEval is a framework that evaluates language models by benchmarking them against a variety of linguistic tasks such as syntactic parsing, named entity recognition, and sentiment analysis, providing objective and repeatable assessments.
Inspired by ScandEval's method of using automated benchmarking to provide objective and repeatable evaluations, we developed similar automated metrics for literary translation. This approach minimizes subjectivity and ensures consistent assessment standards.
Additionally, ScandEval's effective visualization techniques, such as radar charts and spider plots, inspired us to incorporate visual tools to represent our evaluation metrics. This not only makes the data more accessible but also helps in identifying strengths and weaknesses at a glance.
Case Studies
Since much of my work is done under NDAs, I cannot share specific books or cases. However, I can share the general framework and findings from my experience.
Since April last year, I have worked with seven publishing houses on 44 books, 37 of which have been published to date, with the last ones scheduled for publication next month.
I have directly translated 7 of these books and served as a consultant for the rest, helping publishers establish workflows and guide their translators in enhancing MT output through the metrics approach.
The results have shown significant improvements in scores across all metrics, demonstrating enhanced quality and fidelity of the translated texts.
Proposed Metrics for Literary MT
Many of these techniques are already used in NLP-based editor tasks. I am now transferring them to the context of translated books to enhance the quality and consistency of literary translations.
Not all aspects have been automated or reached their peak performance due to a lack of data, especially when dealing with copyrighted materials like books. Pure textual elements such as grammar are easier to automate because they don't require copyrighted content, and we have access to excellent open-access libraries to perform these tasks using NLP.
New Set of Metrics
To address these gaps, I propose a new set of metrics specifically designed for literary MT:
Grammatical Accuracy (GA): Measures the correctness of grammatical structures in the translation.
Fluency Score (FS): Assesses the naturalness and readability of the text, ensuring it flows smoothly.
Semantic Accuracy (SA): Evaluates how accurately the translation conveys the meaning and intent of the source text.
Stylistic Coherence (Voice Fidelity) (SCVF): Ensures that the author’s style and voice are consistent throughout the translation.
Plot Coherence (PC): Checks the logical flow and consistency of the narrative.
Cultural Sensitivity (CS): Examines how well the translation handles cultural references and context.
Methodology
Now, let's talk about how we apply these metrics in practice.
The methodology is most effective as a translator tool when applied at different levels, differentiating book level and chapter level, ensuring consistent high-quality translations throughout the entire work.
Benchmarking and Initial Assessment
We start by establishing a minimum acceptable score of 75% for each metric. This score serves as our benchmark for quality. The benchmark can be adjusted to fit with each publisher’s own quality score.
The evaluation method is both technology-neutral and method-neutral, meaning it can be applied regardless of the tools or techniques used in the translation process. It is designed to measure the quality of the text at any given stage, whether it's a draft or a near-final version. Use it initially to identify areas that need focus and improvement and apply it again at the end to ensure the text meets quality criteria.
While I use this method to talk about literary machine translation, it is equally applicable to human translations, providing a consistent framework for assessing translation quality across different approaches.
Example: Machine Translation Post-Editing (MTPE)
Human translators refine the MT text based on these scores, focusing on areas identified as weak in the initial assessment. After post-editing, we re-evaluate the text using the same six metrics to ensure it meets or exceeds our 75% benchmark. Finally, we confirm the quality of the translation and provide detailed feedback to translators, which helps in continuous improvement.
Conclusion
To summarize, specialized metrics for literary MT are crucial for preserving the integrity of literary works. Our methodology combines automated evaluation with human expertise, leading to significant improvements in translation quality.
The need to automate the evaluation process, inspired by ScandEval, is what makes this tool efficient and easy to apply. Automation allows for consistent, objective assessments at any stage of the translation process, ensuring high-quality results without significantly increasing the workload for translators and editors.
Advocating for Fair Compensation
On the translator side, there's a growing demand to deliver high-quality work at low pay, which is an unsustainable expectation. My hope is that by providing a clear method and framework for discussing and evaluating literary quality, it will become easier for translators to navigate this challenging landscape.
This framework can help translators set clear boundaries and demonstrate to publishers that price and quality are inherently connected and cannot be addressed solely by applying AI. By making the evaluation process transparent and quantifiable, translators can better advocate for fair compensation that reflects the true value of their work.
Balancing Quality and Budget
From the perspective of publishing houses, balancing high-quality translations with budget constraints is a significant challenge. My proposed evaluation framework provides a structured, objective method for assessing translation quality at various stages. This tool helps publishers make informed decisions about investing resources in human translators or refining machine-generated drafts.
The framework shows that relying solely on AI may not achieve the desired quality. Instead, a balanced approach that integrates both machine translation and human expertise is necessary. The framework also facilitates transparent discussions with translators, setting realistic expectations and budgets. Ultimately, it helps achieve better translations while fostering a collaborative environment between publishers and translators.
Looking Ahead
Looking ahead, there is great potential for refining and expanding this metric system. We encourage further research and collaboration to continue advancing the field.
Thank you.
Comments