如何将二元语法编程为 python 中的 table?

How do I program bigram as a table in python?

我正在做这个作业,我卡在了这一点上。 我无法在 python?

中编写 Bigram frequency in the English language、'conditional probability'

That is, the probability of a token given the preceding token is equal to the probability of their bigram, or the co-occurrence of the two tokens , divided by the probability of the preceding token.

我有一个文本有很多字母,然后我计算了这个文本中字母的概率,所以字母 'a' 出现 0.015% 与文本中的字母相比。

字母来自^a-zA-Z,我要的是:
如何用字母表的长度 ((alphabet)x(alphabet)) 制作 table,以及如何计算每种情况的条件概率?

就像:

[[(a|a),(b|a),(c|a),...,(z|a),...(Z|a)]
 [(a|b),(b|b),(c|b),...,(z|b),...(Z|b)]
                    ...       ...
 [(a|Z),(b|Z),(c|Z),...,(z|Z),...(Z|Z)]]

为此我应该计算概率,例如:如果此时你有一个字母 'a',你得到字母 'a' 的机会有多大,等等。

我无法启动,希望你能启动我,并希望清楚我需要解决的问题。

假设您的文件没有其他标点符号(很容易去掉):

import itertools

def pairwise(s):
    a,b = itertools.tee(s)
    next(b)
    return zip(a,b)

counts = [[0 for _ in range(52)] for _ in range(52)]  # nothing has occurred yet
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):  # get pairwise characters from the text
        given = ord(a) - ord('a')  # index (in `counts`) of the "given" character
        char = ord(b) - ord('a')   # index of the character that follows the "given" character
        counts[given][char] += 1

# now that we have the number of occurrences, let's divide by the totals to get conditional probabilities

totals = [sum(count[i] for i in range(52)) for count in counts]
for given in range(52):
    if not totals[given]:
        continue
    for i in range(len(counts[given])):
        counts[given][i] /= totals[given]

我还没有测试过这个,但它应该是一个好的开始

这里是字典版本,应该更容易阅读和调试:

counts = {}
with open('path/to/input') as infile:
    for a,b in pairwise(char for line in infile for word in line.split() for char in word):
        given = ord(a) - ord('a')
        char = ord(b) - ord('a')
        if given not in counts:
            counts[given] = {}
        if char not in counts[given]:
            counts[given][char] = 0
        counts[given][char] += 1

answer = {}
for given, chardict in answer.items():
    total = sum(chardict.values())
    for char, count in chardict.items():
        answer[given][char] = count/total

现在,answer 包含您要计算的概率。如果你想要'a'的概率,给定'b',看answer['b']['a']