如何使用 KenLM 计算困惑度?
How to compute perplexity using KenLM?
假设我们在此基础上建立了一个模型:
$ wget https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
$ lmplz -o 5 < something.txt > something.arpa
来自困惑度公式(https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf)
用反对数公式求和得到内变量再取n次方根,困惑数异常小:
>>> import kenlm
>>> m = kenlm.Model('something.arpa')
# Sentence seen in data.
>>> s = 'The development of a forward-looking and comprehensive European migration policy,'
>>> list(m.full_scores(s))
[(-0.8502398729324341, 2, False), (-3.0185394287109375, 3, False), (-0.3004383146762848, 4, False), (-1.0249041318893433, 5, False), (-0.6545327305793762, 5, False), (-0.29304179549217224, 5, False), (-0.4497605562210083, 5, False), (-0.49850910902023315, 5, False), (-0.3856896460056305, 5, False), (-0.3572353720664978, 5, False), (-1.7523181438446045, 1, False)]
>>> n = len(s.split())
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> math.pow(sum_inv_logs, 1.0/n)
1.2536033936438895
用数据中找不到的句子重试:
# Sentence not seen in data.
>>> s = 'The European developement of a forward-looking and comphrensive society is doh.'
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> sum_inv_logs
35.59524390101433
>>> n = len(s.split())
>>> math.pow(sum_inv_logs, 1.0/n)
1.383679905428275
并再次尝试使用完全域外的数据:
>>> s = """On the evening of 5 May 2017, just before the French Presidential Election on 7 May, it was reported that nine gigabytes of Macron's campaign emails had been anonymously posted to Pastebin, a document-sharing site. In a statement on the same evening, Macron's political movement, En Marche!, said: "The En Marche! Movement has been the victim of a massive and co-ordinated hack this evening which has given rise to the diffusion on social media of various internal information"""
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> sum_inv_logs
282.61719834804535
>>> n = len(list(m.full_scores(s)))
>>> n
79
>>> math.pow(sum_inv_logs, 1.0/n)
1.0740582373271952
虽然预计较长的句子具有较低的困惑度,但奇怪的是差异小于1.0并且在小数点范围内。
以上是使用 KenLM 计算困惑度的正确方法吗?如果没有,有谁知道如何通过 Python API?
来计算 KenLM 的困惑度
困惑度公式为:
但这是原始概率,所以在代码中:
import numpy as np
import kenlm
m = kenlm.Model('something.arpa')
# Because the score is in log base 10, so:
product_inv_prob = np.prod([math.pow(10.0, score) for score, _, _ in m.full_scores(s)])
n = len(list(m.full_scores(s)))
perplexity = math.pow(product_inv_prob, 1.0/n)
或者直接使用 log (base 10) prob:
sum_inv_logprob = -1 * sum(score for score, _, _ in m.full_scores(s))
n = len(list(m.full_scores(s)))
perplexity = math.pow(10.0, sum_inv_logs / n)
来源:https://www.mail-archive.com/moses-support@mit.edu/msg15341.html
见https://github.com/kpu/kenlm/blob/master/python/kenlm.pyx#L182
import kenlm
model=kenlm.Model("something.arpa")
per=model.perplexity("your text sentance")
print(per)
只想评论一下alvas的回答
sum_inv_logprob = sum(score for score, _, _ in m.full_scores(s))
实际上应该是:
sum_inv_logprob = -1.0 * sum(score for score, _, _ in m.full_scores(s))
你可以简单地使用
import numpy as np
import kenlm
m = kenlm.Model('something.arpa')
ppl = m.perplexity('something')
假设我们在此基础上建立了一个模型:
$ wget https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
$ lmplz -o 5 < something.txt > something.arpa
来自困惑度公式(https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf)
用反对数公式求和得到内变量再取n次方根,困惑数异常小:
>>> import kenlm
>>> m = kenlm.Model('something.arpa')
# Sentence seen in data.
>>> s = 'The development of a forward-looking and comprehensive European migration policy,'
>>> list(m.full_scores(s))
[(-0.8502398729324341, 2, False), (-3.0185394287109375, 3, False), (-0.3004383146762848, 4, False), (-1.0249041318893433, 5, False), (-0.6545327305793762, 5, False), (-0.29304179549217224, 5, False), (-0.4497605562210083, 5, False), (-0.49850910902023315, 5, False), (-0.3856896460056305, 5, False), (-0.3572353720664978, 5, False), (-1.7523181438446045, 1, False)]
>>> n = len(s.split())
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> math.pow(sum_inv_logs, 1.0/n)
1.2536033936438895
用数据中找不到的句子重试:
# Sentence not seen in data.
>>> s = 'The European developement of a forward-looking and comphrensive society is doh.'
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> sum_inv_logs
35.59524390101433
>>> n = len(s.split())
>>> math.pow(sum_inv_logs, 1.0/n)
1.383679905428275
并再次尝试使用完全域外的数据:
>>> s = """On the evening of 5 May 2017, just before the French Presidential Election on 7 May, it was reported that nine gigabytes of Macron's campaign emails had been anonymously posted to Pastebin, a document-sharing site. In a statement on the same evening, Macron's political movement, En Marche!, said: "The En Marche! Movement has been the victim of a massive and co-ordinated hack this evening which has given rise to the diffusion on social media of various internal information"""
>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>> sum_inv_logs
282.61719834804535
>>> n = len(list(m.full_scores(s)))
>>> n
79
>>> math.pow(sum_inv_logs, 1.0/n)
1.0740582373271952
虽然预计较长的句子具有较低的困惑度,但奇怪的是差异小于1.0并且在小数点范围内。
以上是使用 KenLM 计算困惑度的正确方法吗?如果没有,有谁知道如何通过 Python API?
来计算 KenLM 的困惑度困惑度公式为:
但这是原始概率,所以在代码中:
import numpy as np
import kenlm
m = kenlm.Model('something.arpa')
# Because the score is in log base 10, so:
product_inv_prob = np.prod([math.pow(10.0, score) for score, _, _ in m.full_scores(s)])
n = len(list(m.full_scores(s)))
perplexity = math.pow(product_inv_prob, 1.0/n)
或者直接使用 log (base 10) prob:
sum_inv_logprob = -1 * sum(score for score, _, _ in m.full_scores(s))
n = len(list(m.full_scores(s)))
perplexity = math.pow(10.0, sum_inv_logs / n)
来源:https://www.mail-archive.com/moses-support@mit.edu/msg15341.html
见https://github.com/kpu/kenlm/blob/master/python/kenlm.pyx#L182
import kenlm
model=kenlm.Model("something.arpa")
per=model.perplexity("your text sentance")
print(per)
只想评论一下alvas的回答
sum_inv_logprob = sum(score for score, _, _ in m.full_scores(s))
实际上应该是:
sum_inv_logprob = -1.0 * sum(score for score, _, _ in m.full_scores(s))
你可以简单地使用
import numpy as np
import kenlm
m = kenlm.Model('something.arpa')
ppl = m.perplexity('something')