1. BLEURT is a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples.
2. A novel pre-training scheme is used to help the model generalize, using millions of synthetic examples.
3. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset.
The article presents a new evaluation metric for text generation, called BLEURT, which is based on BERT and can be trained with a few thousand possibly biased training examples. The article claims that this metric provides state-of-the-art results on two datasets, but does not provide any evidence to support this claim or discuss potential biases in the data used for training. Additionally, it does not explore any counterarguments or present both sides of the argument equally. Furthermore, it does not mention any possible risks associated with using this metric or note any potential limitations of its use. As such, it is difficult to assess the trustworthiness and reliability of this article without further evidence or discussion of potential biases and risks associated with its use.