如何解释 fairseq 生成的 P 数？

Question

使用 fairseq-generate.py，使用 transformer 架构，每次翻译都会产生这样的部分：

Why is it rare to discover new marine mammal species?
S-0     Why is it rare to discover new marine mam@@ mal species ?
H-0     -0.0643349438905716     Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0     -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015

与this explanation:

H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker

我想知道是否合理地说 P 行中的低（绝对）数字意味着对该特定词的更高信心？例如。 "Pourquoi" 的 -0.07 是否意味着它比 "est-il" 的 (-0.1849) 更快乐？最后的低 -0.0015 意味着它真的有信心句子应该到此结束。

背景：我正在尝试解决的问题是，我是否可以使用 H 编号，或者以某种方式使用各个 P 编号，来获得翻译的置信度。我一直在根据 H 值分析一些翻译，并没有注意到它与我对翻译质量的主观看法之间有多少对应关系。但我有几个地方我认为它特别差 - 它错过了一些关键信息 - 最后的 P 数是相对较高的 -0.6099 和 -0.3091 （最后的 P 数是 -0.11 他们中的大多数人都是如此。）

Answer 1

Q: I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word?

是的。正如文档所说，“P 是每个标记位置的位置得分”。分数实际上是对数概率，因此越高（即 绝对值 数值越低）越“自信”。源代码可能不是那么容易理解，但分数是由 SequenceScorer, and there you can see that scores are normalized (which includes a log either if when you're using a single model or an ensemble). Moreover, when printing the scores, they convert them from base e to 2:
生成的
```
print('P-{}\t{}'.format(
    sample_id,
    ' '.join(map(
        lambda x: '{:.4f}'.format(x),
        # convert from base e to base 2
        hypo['positional_scores'].div_(math.log(2)).tolist(),
))
```

Q: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation.

原来H值只是P值的平均值，可以看到here:
```
score_i = avg_probs_i.sum() / tgt_len
```
还有converted to base 2。你可以在你的例子中检查：
```
import numpy as np
print(np.mean([-0.0763,-0.1849 ,-0.0956 ,-0.0946 ,-0.0735 ,-0.1150 ,-0.1301 ,-0.0042 ,-0.0321 ,-0.0171 ,-0.0052 ,-0.0062 ,-0.0015]))
# >>> -0.06433076923076922
```
另一种常用于评估语言模型性能的度量是 fairseq 存储库的 Perplexity. And a good thing is that perplexity can be easily computed based on the P values, as shown in the Language Model example：
```
# Compute perplexity for a sequence
en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores'].mean().neg().exp()
# tensor(15.1474)
```
我不是 NLP 方面的专家，所以我真的不能告诉你在你的情况下应该使用哪一个。

如何解释 fairseq 生成的 P 数？

How to interpret the P numbers that fairseq generate produces?

python

transformer

pytorch