二元概率

Question

我有一个 Moby Dick 语料库，我需要计算二元组“象牙腿”的概率。我知道这个命令给了我所有二元组的列表

bigrams = [w1+" "+w2 for w1,w2 in zip(words[:-1], words[1:])]

但是我怎样才能得到这两个词的概率呢？

Answer 1

您可以计算所有的二元组并计算您要查找的特定二元组。二元组出现 P(bigram) 的概率就是它们的商。 word[1] 给 word[0] 的条件概率 P(w[1] | w[0]) 是 bigram 的出现次数除以 w[0] 的次数的商。例如看二元语法 ('some', 'text'):

s = 'this is some text about some text but not some other stuff'.split()

bigrams = [(s1, s2) for s1, s2 in zip(s, s[1:])]

# [('this', 'is'),
#  ('is', 'some'),
# ('some', 'text'),
# ('text', 'about'),
# ...

number_of_bigrams = len(bigrams)
# 11

# how many times 'some' occurs 
some_count = s.count('some')
# 3

# how many times bigram occurs
bg_count = bigrams.count(('some', 'text'))
# 2

# probabily of 'text' given 'some' P(bigram | some)
# i.e. you found `some`, what's the probability that its' makes the bigram:
bg_count/some_count
# 0.666

# probabilty of bigram in text P(some text)
# i.e. pick a bigram at random, what's the probability it's your bigram:
bg_count/number_of_bigrams
# 0.181818

二元概率

Bigram probability

python

n-gram

pycharm