在 Python 的输入文件中查找最常见的多词

Question

假设我有一个文本文件，我可以使用 Counter 轻松找到最常用的单词。但是，我还想找到多个词，例如“纳税年、飞钓、u.s。国会大厦等。”。一起出现次数最多的词。

import re
from collections import Counter

with open('full.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word for word in words]

word_counts = Counter(cap_words)

for k, v in word_counts.most_common():
    print(k, v)

我目前有这个，但是，这个只能找到一个词。我如何找到多个单词？

Answer 1

您正在寻找的是一种计算 bigrams（包含两个单词的字符串）的方法。

nltk 库非常适合执行大量与语言相关的任务，您可以使用集合中的 Counter 所有与计数相关的活动！

import nltk
from nltk import bigrams
from collections import Counter

tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))

Answer 2

你所说的mutliwords（没有这样的东西）实际上叫做bigrams。您可以通过使用位移将其自身压缩来从单词列表中获取二元组列表：

bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]

P.S。 NLTK 确实是获取二元语法的更好工具。

在 Python 的输入文件中查找最常见的多词

Find most common multi words in an input file in Python

python

nltk

python-3.x