二元概率
Bigram probability
我有一个 Moby Dick 语料库,我需要计算二元组“象牙腿”的概率。
我知道这个命令给了我所有二元组的列表
bigrams = [w1+" "+w2 for w1,w2 in zip(words[:-1], words[1:])]
但是我怎样才能得到这两个词的概率呢?
您可以计算所有的二元组并计算您要查找的特定二元组。二元组出现 P(bigram) 的概率就是它们的商。 word[1] 给 word[0] 的条件概率 P(w[1] | w[0]) 是 bigram 的出现次数除以 w[0] 的次数的商。例如看二元语法 ('some', 'text')
:
s = 'this is some text about some text but not some other stuff'.split()
bigrams = [(s1, s2) for s1, s2 in zip(s, s[1:])]
# [('this', 'is'),
# ('is', 'some'),
# ('some', 'text'),
# ('text', 'about'),
# ...
number_of_bigrams = len(bigrams)
# 11
# how many times 'some' occurs
some_count = s.count('some')
# 3
# how many times bigram occurs
bg_count = bigrams.count(('some', 'text'))
# 2
# probabily of 'text' given 'some' P(bigram | some)
# i.e. you found `some`, what's the probability that its' makes the bigram:
bg_count/some_count
# 0.666
# probabilty of bigram in text P(some text)
# i.e. pick a bigram at random, what's the probability it's your bigram:
bg_count/number_of_bigrams
# 0.181818
我有一个 Moby Dick 语料库,我需要计算二元组“象牙腿”的概率。 我知道这个命令给了我所有二元组的列表
bigrams = [w1+" "+w2 for w1,w2 in zip(words[:-1], words[1:])]
但是我怎样才能得到这两个词的概率呢?
您可以计算所有的二元组并计算您要查找的特定二元组。二元组出现 P(bigram) 的概率就是它们的商。 word[1] 给 word[0] 的条件概率 P(w[1] | w[0]) 是 bigram 的出现次数除以 w[0] 的次数的商。例如看二元语法 ('some', 'text')
:
s = 'this is some text about some text but not some other stuff'.split()
bigrams = [(s1, s2) for s1, s2 in zip(s, s[1:])]
# [('this', 'is'),
# ('is', 'some'),
# ('some', 'text'),
# ('text', 'about'),
# ...
number_of_bigrams = len(bigrams)
# 11
# how many times 'some' occurs
some_count = s.count('some')
# 3
# how many times bigram occurs
bg_count = bigrams.count(('some', 'text'))
# 2
# probabily of 'text' given 'some' P(bigram | some)
# i.e. you found `some`, what's the probability that its' makes the bigram:
bg_count/some_count
# 0.666
# probabilty of bigram in text P(some text)
# i.e. pick a bigram at random, what's the probability it's your bigram:
bg_count/number_of_bigrams
# 0.181818