google-research/metricxPublic

NotificationsYou must be signed in to change notification settings
Fork25
Star133

License

Apache-2.0 license

133 stars 25 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
metricx23		metricx23
metricx24		metricx24
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

MetricX

This is not an officially supported Google product.

This repository contains the code for running inference on MetricX-23 andMetricX-24 models, a family of models for automatic evaluation of translationsthat were proposed in the WMT Metrics Shared Task submissionsMetricX-23: The Google Submission to the WMT 2023 Metrics Shared TaskandMetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task.The models were trained inT5X andthen converted for use in PyTorch.

Available Models

There are 3 MetricX-24 models and 6 MetricX-23 models available on Hugging Facethat vary in the number of parameters. The MetricX-23 models further vary inwhether or not the model is reference-based or reference-free (also known asquality estimation, or QE), whereas the MetricX-24 models are all hybrid modelsthat can do both reference-based and QE inference. Finally, we also offerlower-precision (bfloat16) MetricX-24 model variants with a 50% lower memoryfootprint.

MetricX-24:

MetricX-24 (bfloat16):

MetricX-23:

We recommend using the XXL model versions for the best agreement with humanjudgments of translation quality, the Large versions for best speed, and theXL for an intermediate use case.

Changes to the WMT'24 Submission

The MetricX-24 models available here are most similar to the primary submissionto the WMT'24 Metrics Shared Task. They are initialized withmT5then fine-tuned on a combination of direct assessment and MQM data fromWMT'15-'22. However, we made a couple of small changes that make these modelsdifferent from the WMT'24 submissions.

First, the metric scores get automatically clipped at 0 and 25, to ensure theyare strictly in the [0, 25] range, as due to the nature of regression models,the scores could otherwise sometimes fall outside the range. Note that thisclipping is also present in the MetricX-23 models in this repository, but itwasn't in any of the official submissions to the WMT Metrics shared tasks.

Second, we included one additional type of synthetic training examples thatweren't ready in time for the official submission. These are examples of perfecttranslations of multi-sentence segments, generated from the MQM data fromWMT'20-'22. The purpose of this category of synthetic data is to reduce themodel's bias against longer translations when the source segment and/orreference are also long.

Changes to the WMT'23 Submission

Similarly to the MetricX-24 models in this repository, we made some changes tothe MetricX-23 models as well, which make these models different from the WMT'23submissions.

First, the models are trained to regress the actual MQM score rather than anormalized score between 0 and 1. That means the output from the MetricX-23models is a score in the range [0, 25] where lower is better (i.e., it predictsan error score).

Second, these models were trained with a larger variety of synthetic data thatmakes them more robust to translation edge cases like over- andundertranslation, described in more detail in the following section. Note thatan updated version of the synthetic data was later used for training MetricX-24and is described in detail in theMetricX-24 paper.

Synthetic Data

In order for our MetricX models to learn to identify certain types of badtranslations that are not sufficiently (or at all) represented in the regulartraining data, we created synthetic examples and mixed them in during training.The synthetic training data was generated from the DA datasets ranging fromWMT15 to WMT21 (~ 43 language pairs). In most cases, the synthetic examples havethe candidate translation manipulated so as to turn it into a bad translationwith a specific issue commonly unrecognized by learned metrics.

The table below provides an overview of the various failure modes that weconsidered, including brief descriptions of how we prepared the synthetic datato address them.

Failure mode	Synthetic example description
Undertranslation	Candidate translation with an arbitrary sentence removed (if multi-sentence); alternatively, candidate with a certain proportion of words removed from the end.
Overtranslation	Candidate translation duplicated (with space in between).
Fluent but unrelated translation	Arbitrary reference of a similar length from the dataset.
Gibberish	Text of a similar length as the reference, generated by sampling words from the reference translation vocabulary (built from all references in the data).
Missing punctuation	Reference translation with the end punctuation removed (11 punctuation symbols considered).
Latin instead of Chinese/Japanese or Hindi/Bengali punctuation	Candidate translation with the language-specific punctuation symbol at the end replaced with the Latin equivalent (e.g., "." instead of "。" or "।"); alternatively, the punctuation symbol is replaced with the Latin equivalent in the reference, keeping the correct one in the candidate.
Reference-matching translation	Reference translation copied as the candidate translation (unlike the rest of the synthetic data, these examples are meant to train the metric to predict a perfect score for candidates matching the reference).

Examples from the first 4 categories were assigned a label corresponding to theworst score on the given rating scale (e.g., 25 when mixed with MQM trainingdata), whereas the reference-matching translation examples are assigned the bestscore (e.g., 0 when used with MQM data). The missing/incorrect punctuationexamples were labeled with a score slightly worse than perfect.

Note that some of the synthetic datasets are only meaningful in thereference-based scenario, and we thus excluded them when training a QE variantof MetricX. These are the Latin-vs-special punctuation and thereference-matching translation examples.

Most of the synthetic training sets were created using stratified samplingacross target languages, taking 500 examples per target language. One exceptionis the missing punctuation set, which used a stratified sample across differentpunctuation symbols instead.

When training MetricX, a small proportion of the synthetic examples was mixedwith the regular training examples. During the first-stage fine-tuning on DAdata, each synthetic training set constituted between 0.1% and 1% of alltraining examples, whereas in the second-stage fine-tuning on MQM data we usedan even smaller proportion, around 0.05%.

As for evaluating the effect of the synthetic training data on the model'sperformance, the DEMETR challenge set - which we originally used to evaluate themodels submitted to the WMT23 Metrics Shared Task - was not adequate anymore. Wetherefore created a new DEMETR-style test set based on the WMT22 DA data, withexamples constructed analogically to the synthetic training examples, asdescribed above. This test set helped us determine the right proportions ofsynthetic data for fine-tuning in order to make MetricX robust for the failuremodes in consideration, without sacrificing the system- and segment-levelcorrelations with human ratings.

Usage

Themetricx24/predict.py andmetricx23/predict.py scripts contain examplesfor how to run inference on the models.

Reference-Based

Example usage for a reference-based MetricX-24 model:

python -m metricx24.predict \  --tokenizer google/mt5-xl \  --model_name_or_path google/metricx-24-hybrid-xl-v2p6 \  --max_input_length 1536 \  --batch_size 1 \  --input_file input.jsonl \  --output_file output.jsonl

input.jsonl is expected to have 1 serialized JSON object per line with"source","hypothesis" and"reference" fields. If source segments are notavailable, the empty string can be passed as the value of"source". The outputjsonl will be parallel toinput.jsonl but additionally contain a"prediction" field with the predicted score.

Example usage for a reference-based MetricX-23 model:

python -m metricx23.predict \  --tokenizer google/mt5-xl \  --model_name_or_path google/metricx-23-xl-v2p0 \  --max_input_length 1024 \  --batch_size 1 \  --input_file input.jsonl \  --output_file output.jsonl

For use with MetricX-23 models,input.jsonl is expected to have 1 serializedJSON object per line with only the"hypothesis" and"reference" fields (no"source").

WARNING: The models were trained with a maximum input length of 1536 and 1024tokens, respectively, so significantly increasing that value may lead tounpredictable behavior.

Reference-Free

Example usage for a reference-free MetricX-24 model:

python -m metricx24.predict \  --tokenizer google/mt5-xl \  --model_name_or_path google/metricx-24-hybrid-xl-v2p6 \  --max_input_length 1536 \  --batch_size 1 \  --input_file input.jsonl \  --output_file output.jsonl \  --qe

Since MetricX-24 models are hybrid, for reference-free evaluation they expectthe same input features as for reference-based evaluation, with the differencethat the"reference" field should be passed the empty string.input.jsonlis thus expected to have 1 serialized JSON object per line with"source","hypothesis" and"reference" fields. The output jsonl will be parallel toinput.jsonl but additionally contain a"prediction" field with the predictedscore.

Example usage for a reference-free MetricX-23 model:

python -m metricx23.predict \  --tokenizer google/mt5-xl \  --model_name_or_path google/metricx-23-qe-xl-v2p0 \  --max_input_length 1024 \  --batch_size 1 \  --input_file input.jsonl \  --output_file output.jsonl \  --qe

For use with MetricX-23 models,input.jsonl is expected to have 1 serializedJSON object per line with"source" and"hypothesis" fields (no"reference").

Meta-Evaluation

Themetricx23/evaluate.py andmetricx24/evaluate.py scripts contain code tocalculate various correlations between the MetricX scores and MQM ratings oftranslation quality using theMT Metrics Eval library.

Example usage for a MetricX-24 model:

python -m metricx24.evaluate \  --dataset wmt24 \  --lp en-es \  --input_file input.jsonl \  --output_file output.json

Example usage for a MetricX-23 model:

python -m metricx23.evaluate \  --dataset wmt22 \  --lp en-de \  --input_file input.jsonl \  --output_file output.json

input.jsonl is expected to have one JSON object serialized per line.Each JSON object is expected to contain 4 fields:

"system_id": The name of the system that generated the translation.
"segment_id": The 0-based index of the corresponding segment in the MTMetrics Eval data.
"label": The ground-truth translation quality score (with higher is better).
"prediction": The model predicted translation quality score (with lower isbetter; the script negates the scores so higher is better).

The script will calculate the 4 agreement/correlations that were used in theWMT'23 Metrics Shared Task, as well as the system-level soft pairwise accuracy(SPA), newly added in the WMT'24 Metrics Shared Task. Below are the results forthe MetricX-24 models on the WMT'24 Metrics Shared Task data and for theMetricX-23 models on the WMT'22 Metrics Shared Task data:

MetricX-24 (WMT'24):

Model	Sys-Level SPA (en-de)	Seg-Level Acc (en-de)	Sys-Level SPA (en-es)	Seg-Level Acc (en-es)	Sys-Level SPA (ja-zh)	Seg-Level Acc (ja-zh)
MetricX-24-Hybrid-XXL	0.865	0.543	0.785	0.685	0.878	0.541
MetricX-24-Hybrid-XL	0.884	0.522	0.806	0.683	0.859	0.528
MetricX-24-Hybrid-Large	0.879	0.511	0.795	0.686	0.845	0.514
MetricX-24-Hybrid-QE-XXL	0.884	0.525	0.789	0.685	0.863	0.527
MetricX-24-Hybrid-QE-XL	0.879	0.502	0.774	0.683	0.849	0.509
MetricX-24-Hybrid-QE-Large	0.809	0.490	0.762	0.684	0.847	0.508

NOTE: Since MetricX-24 models are hybrid models, MetricX-24-<size> andMetricX-24-QE-<size> correspond to the same model, evaluatedwith andwithout the references, respectively.

NOTE: The performance of the bfloat16 model counterparts deviates by less than0.001 from the above results across both metrics and all language pairs.

MetricX-23 (WMT'22 en-de):

Model	Sys-Level Acc	Sys-Level Pearson	Seg-Level Pearson	Seg-Level Acc
MetricX-23-XXL	0.795	0.835	0.546	0.619
MetricX-23-XL	0.756	0.813	0.540	0.605
MetricX-23-Large	0.769	0.759	0.507	0.595
MetricX-23-QE-XXL	0.769	0.830	0.490	0.606
MetricX-23-QE-XL	0.718	0.684	0.421	0.594
MetricX-23-QE-Large	0.744	0.671	0.387	0.579

MetricX-23 (WMT'22 en-ru):

Model	Sys-Level Acc	Sys-Level Pearson	Seg-Level Pearson	Seg-Level Acc
MetricX-23-XXL	0.905	0.943	0.477	0.609
MetricX-23-XL	0.876	0.906	0.498	0.589
MetricX-23-Large	0.876	0.841	0.474	0.569
MetricX-23-QE-XXL	0.895	0.940	0.470	0.602
MetricX-23-QE-XL	0.848	0.861	0.415	0.570
MetricX-23-QE-Large	0.819	0.778	0.411	0.551

MetricX-23 (WMT'22 zh-en):

Model	Sys-Level Acc	Sys-Level Pearson	Seg-Level Pearson	Seg-Level Acc
MetricX-23-XXL	0.868	0.919	0.605	0.551
MetricX-23-XL	0.868	0.924	0.584	0.543
MetricX-23-Large	0.857	0.919	0.555	0.539
MetricX-23-QE-XXL	0.857	0.928	0.573	0.544
MetricX-23-QE-XL	0.802	0.879	0.546	0.529
MetricX-23-QE-Large	0.758	0.904	0.522	0.529

Themetricx24/evaluate_wmt24.py script re-calculates the average correlationscore for MetricX-24 that was used to rank submissions from theWMT'24 Metrics Shared Task.Similarly, themetricx23/evaluate_wmt23.py script re-calculates the averagecorrelation score for MetricX-23 that was used to rank submissions from theWMT'23 Metrics Shared Task.

Example usage for a MetricX-24 model:

python -m metricx24.evaluate_wmt24 \  --en_de predictions_ende.jsonl \  --en_es predictions_enes.jsonl \  --ja_zh predictions_jazh.jsonl \  --output_file output.json

Example usage for a MetricX-23 model:

python -m metricx23.evaluate_wmt23 \  --en_de predictions_ende.jsonl \  --he_en predictions_heen.jsonl \  --zh_en predictions_zhen.jsonl \  --output_file output.json

Each of the 3 input files is expected to be in the same format as describedabove. Each file should correspond to running inference on each of the languagepairs from the WMT'24 / WMT'23 dataset. Below are the results for the MetricX-24models on the WMT'24 Metrics Shared Task data and for the MetricX-23 models onthe WMT'22 Metrics Shared Task data:

MetricX-24 (WMT'24):

Model	Average Correlation
MetricX-24-Hybrid-XXL	0.716
MetricX-24-Hybrid-XL	0.714
MetricX-24-Hybrid-Large	0.705
MetricX-24-Hybrid-QE-XXL	0.712
MetricX-24-Hybrid-QE-XL	0.699
MetricX-24-Hybrid-QE-Large	0.683

NOTE: Since MetricX-24 models are hybrid models, MetricX-24-<size> andMetricX-24-QE-<size> correspond to the same model, evaluatedwith andwithout the references, respectively.

MetricX-23 (WMT'23):

Model	Average Correlation
MetricX-23-XXL	0.812
MetricX-23-XL	0.813
MetricX-23-Large	0.794
MetricX-23-QE-XXL	0.797
MetricX-23-QE-XL	0.767
MetricX-23-QE-Large	0.762

Citation

If you use MetricX-24 in your research, please cite the following publication:

@inproceedings{juraska-etal-2024-metricx,title ="{M}etric{X}-24: The {G}oogle Submission to the {WMT} 2024 Metrics Shared Task",author ="Juraska, Juraj  and      Deutsch, Daniel  and      Finkelstein, Mara  and      Freitag, Markus",editor ="Haddow, Barry  and      Kocmi, Tom  and      Koehn, Philipp  and      Monz, Christof",booktitle ="Proceedings of the Ninth Conference on Machine Translation",month = nov,year ="2024",address ="Miami, Florida, USA",publisher ="Association for Computational Linguistics",url ="https://aclanthology.org/2024.wmt-1.35",pages ="492--504",}

If you use MetricX-23 in your research, please cite the following publication:

@inproceedings{juraska-etal-2023-metricx,title ={{MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task}},author ="Juraska, Juraj  and      Finkelstein, Mara  and      Deutsch, Daniel  and      Siddhant, Aditya  and      Mirzazadeh, Mehdi  and      Freitag, Markus",editor ="Koehn, Philipp  and      Haddow, Barry  and      Kocmi, Tom  and      Monz, Christof",booktitle ="Proceedings of the Eighth Conference on Machine Translation",month = dec,year ="2023",address ="Singapore",publisher ="Association for Computational Linguistics",url ="https://aclanthology.org/2023.wmt-1.63",doi ="10.18653/v1/2023.wmt-1.63",pages ="756--767",}

About

No description, website, or topics provided.

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

MetricX

Available Models

Changes to the WMT'24 Submission

Changes to the WMT'23 Submission

Synthetic Data

Usage

Reference-Based

Reference-Free

Meta-Evaluation

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

google-research/metricx

Folders and files

Latest commit

History

Repository files navigation

MetricX

Available Models

Changes to the WMT'24 Submission

Changes to the WMT'23 Submission

Synthetic Data

Usage

Reference-Based

Reference-Free

Meta-Evaluation

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages