在字符串中创建一个单词字典，其值是该单词后面的单词

Question

我想从一个文本文件创建一个字典，使用每个唯一的词作为键，并创建一个键后面的词的字典，该词的计数作为值。例如，看起来像这样的东西：

>>>string = 'This is a string'
>>>word_counts(string)
{'this': {'is': 1}, 'is': {'a': 1}, 'a': {'string': 1}}

创建唯一单词的字典没有问题，它正在为我坚持的以下单词值创建字典。如果有单词重复，我不能使用 list.index() 操作。除此之外，我有点不知所措。

Answer 1

你可以利用Counter来实现你想要的：

from collections import Counter, defaultdict

def get_tokens(string):
    return string.split()  # put whatever token-parsing algorithm you want here

def word_counts(string):
    tokens = get_tokens(string)
    following_words = defaultdict(list)
    for i, token in enumerate(tokens):
        if i:
            following_words[tokens[i - 1]].append(token)
    return {token: Counter(words) for token, words in following_words.iteritems()}

string = 'this is a string'
print word_counts(string)  # {'this': Counter({'is': 1}), 'a': Counter({'string': 1}), 'is': Counter({'a': 1})}

Answer 2

实际上，collections.Counter class isn't always the best choice to count something. You can use collections.defaultdict:

from collections import defaultdict

def bigrams(text):
    words = text.strip().lower().split()
    counter = defaultdict(lambda: defaultdict(int))
    for prev, current in zip(words[:-1], words[1:]):
        counter[prev][current] += 1
    return counter

请注意，如果您的文本也包含标点符号，则 words = text.strip().lower().split() 行应替换为 words = re.findall(r'\w+', text.lower())。

如果您的文本太大以至于性能很重要，您可以考虑 itertools docs 中的 pairwise 方法，或者，如果您使用的是 python2，itertools.izip 而不是 zip.

Answer 3

只是为了提供一个替代选项（我想其他答案更适合您的需要）您可以使用 itertools 中的 pairwise 食谱：

from itertools import tee, izip

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
  return izip(a, b)

那么函数可以编码为：

def word_counts(string):
    words = string.split()
    result = defaultdict(lambda: defaultdict(int))
    for word1, word2 in pairwise(words):
        result[word1][word2] += 1
    return result

测试：

string = 'This is a string is not an int is a string'
print word_counts(string)

产生：

{'a': {'string': 2}, 'string': {'is': 1}, 'This': {'is': 1}, 'is': {'a': 2, 'not': 1}, 'an': {'int': 1}, 'int': {'is': 1}, 'not': {'an': 1}}

在字符串中创建一个单词字典，其值是该单词后面的单词

creating a dictionary of words in string whose values are words following that word

python

dictionary

word-count