mteval-v13a.pl 和 NLTK BLEU 有什么区别？

Question

Python NLTK 中有一个 BLEU 分数的实现， nltk.translate.bleu_score.corpus_bleu

但我不确定它是否与mtevalv13a.pl script相同。

它们有什么区别？

Answer 1

TL;DR

评估机器翻译系统时使用 https://github.com/mjpost/sacrebleu。

简而言之

不，NLTK 中的 BLEU 与 mteval-13a.perl 并不完全相同。

nltk.translate.corpus_bleu corresponds to mteval-13a.pl up to the 4th order of ngram with some floating point discrepancies

下载

import nltk
nltk.download('wmt15_eval')

主要区别：

中龙

mteval-13a.pl和nltk.translate.corpus_bleu有几个区别：

第一个区别是 mteval-13a.pl 带有自己的 NIST 分词器 而 BLEU 的 NLTK 版本是度量的实现并且 假设输入是预标记的 。
- 顺便说一句，这个 ongoing PR 将弥合 NLTK 和 NIST 分词器之间的差距
另一个主要区别是 mteval-13a.pl 期望输入为 .sgm 格式，而 NLTK BLEU 接受 python 字符串列表列表，请参阅README.txt in the zipball here for more information of how to convert textfile to SGM.
mteval-13a.pl 期望 ngram 顺序至少为 1-4。如果 sentence/corpus 的最小 ngram 阶数小于 4，则 return 的概率为 0，即 math.log(float('-inf'))。为了模拟这种行为，NLTK 设置了一个 _emulate_multibleu 标志：
- 见https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L477
mteval-13a.pl 能够生成 NIST 分数，而 NLTK 没有 NIST 分数实现（至少现在还没有）
- NLTK 中的 NIST 分数是 upcoming in this PR

除了差异之外，NLTK BLEU 分数包含更多特征：

处理原始 BLEU（Papineni，2002 年）忽略的边缘案例
- 见https://github.com/nltk/nltk/pull/1383
另外，为了处理 Ngram 的最大阶数 < 4 的边缘情况，将重新加权单个 ngram 精度的统一权重，使权重的质量总和为 1.0
- 见https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L175
同时 NIST has a smoothing method for geometric sequence smoothing, NLTK has an equivalent object with the same smoothing method and even more smoothing methods to handle sentence level BLEU from Chen and Collin, 2014

最后，为了验证 NLTK 版本的 BLEU 中添加的功能，为它们添加了回归测试，请参阅 https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py

What is the difference between mteval-v13a.pl and NLTK BLEU?