如何使用 BLEU 分数将您的模型与现有模型进行比较？

How to use BLEU score to compare your model to existing models?

所以我使用 BLEU 分数指标将我的 NMT 模型的性能与现有模型进行比较。但是，我想知道我需要多少设置才能与其他型号相匹配。

我认为开发集、测试集和超参数等设置是可行的。但是，我使用的预处理步骤与现有模型不同，所以我想知道我的模型的 BLEU 分数是否可以与其他模型进行比较。现有模型也有可能具有未报告的隐藏参数。

https://arxiv.org/pdf/1804.08771.pdf 解决报告 BLEU 和调用切换到 SacreBLEU 的问题。但是许多现有模型使用 BLEU，所以我认为我不能在我的模型上使用 SacreBLEU 分数指标。

tl;博士

SacreBLEU 不是一个不同的指标，它是 BLEU 的一个实现，所以你在论文中看到的 BLEU 报告，应该与你从 SacreBLEU 得到的相当。尽可能使用 SacreBLEU。

BLEU 分数的简史

BLEU 分数对标记化非常敏感，因此每个人都使用相同的分数很重要。本来，有一个Perl implementation from 2001 which was considered the canonical implementation of BLEU for a long time. Using the script has many hassles (it is in Perl, requires the data to be in a rather obscure SGM format). Because of that (and because BLEU score is fairly simple) many independent implementations appeared, e.g., in MultEval, NLTK。它们更易于使用，但由于数据预处理中的一些细微差异，不会产生相同的结果。 SacreBLEU 可以进行相同的标记化并获得与原始 Perl 脚本相同的分数，但以明文形式读取数据并且处于 Python 目前在机器翻译中使用最多的

如何使用 BLEU 分数将您的模型与现有模型进行比较？

How to use BLEU score to compare your model to existing models?

machine-translation

neural-mt

seq2seq

tl;博士

BLEU 分数的简史