清理 .txt 并计算最常用的单词

Question

我需要

1) 从停用词列表中清理一个 .txt，我在一个单独的 .txt 中。

2) 之后我需要统计出现频率最高的25个词

这是我为第一部分想出的：

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

import re
from collections import Counter

f=open("text_to_be_cleaned.txt")
txt=f.read()
with open("stopwords.txt") as f:
    stopwords = f.readlines()
stopwords = [x.strip() for x in stopwords]

querywords = txt.split()
resultwords  = [word for word in querywords if word.lower() not in stopwords]
cleantxt = ' '.join(resultwords)

对于第二部分，我使用此代码：

words = re.findall(r'\w+', cleantxt)
lower_words = [word.lower() for word in words]
word_counts = Counter(lower_words).most_common(25)
top25 = word_counts[:25]

print top25

待清理的源文件如下所示：

(b)

第二段第一句末尾插入“致高级代表”；在第二句中，“每年举行一次辩论”应改为“每年举行两次辩论”，并在末尾插入“包括共同安全和防务政策”等字样。

停用词列表如下所示： 这个这个他们你这然后因此鸟巢那儿他们

当我运行所有这一切时，输出仍然包含停用词列表中的单词：
[('article', 911), ('european', 586), ('the', 586), ('council', 569), ('union', 530) , ('member', 377), ('states', 282), ('parliament', 244), ('commission', 230), ('accordance', 217) , ('treaty', 187), ('in', 174), ('procedure', 161), ('policy', 137), ('cooperation', 136) , ('legislative', 136), ('acting', 130), ('act', 125), ('amended', 125), ('state', 123) , ('provisions', 115), ('security', 113), ('measures', 111), ('adopt', 109), ('common', 108) ]

您可能会说，我刚刚开始学习 python，所以非常感谢简单的解释！ :)

使用的文件可以在这里找到：

Stopwordlist

File to be cleaned

编辑：添加了源文件、停用词文件和输出的示例。提供源文件

Answer 1

这是一个大胆的猜测，但我认为问题出在这里：

querywords = txt.split()

您只是拆分了文本，这意味着一些停用词可能仍粘在标点符号上，因此在下一步中不会被过滤。

>>> text = "Text containing stop words like a, the, and similar"
>>> stopwords = ["a", "the", "and"]
>>> querywords = text.split()
>>> cleantxt = ' '.join(w for w in querywords if w not in stopwords)
>>> cleantxt
'Text containing stop words like a, the, similar'

相反，您可以像稍后在代码中那样使用 re.findall：

>>> querywords = re.findall(r"\w+", text)
>>> cleantxt = ' '.join(w for w in querywords if w not in stopwords)
>>> cleantxt
'Text containing stop words like similar'

但是请注意，这会将 "re-arranged" 等复合词拆分为 "re" 和 "arranged"。如果这不是您想要的，您也可以使用它来仅按空格分割，然后 trim （一些）标点字符（不过文本中可能会有更多）：

querywords = [w.strip(" ,.-!?") for w in txt.split()]

仅更改这一行似乎可以解决您提供的输入文件的问题。

其余的看起来还不错，虽然有一些小问题：

您应该将 stopwords 转换为 set 以便查找是 O(1) 而不是 O(n)
确保lower停用词，如果它们还没有
如果您打算之后再次拆分，则无需 ' '.join 清理后的文本
top25 = word_counts[:25] 是多余的，列表最多已经有 25 个元素了

Answer 2

你的代码差不多了，主要的错误是你是运行正则表达式 \w+ 来分组在你 "cleaned" 由 str.split 产生的单词。这不起作用，因为标点符号仍将附加到 str.split 结果。请尝试使用以下代码。

import re
from collections import Counter

with open('treaty_of_lisbon.txt', encoding='utf8') as f:
    target_text = f.read()

with open('terrier-stopwords.txt', encoding='utf8') as f:
    stop_word_lines = f.readlines()

target_words = re.findall(r'[\w-]+', target_text.lower())
stop_words = set(map(str.strip, stop_word_lines))

interesting_words = [w for w in target_words if w not in stop_words]
interesting_word_counts = Counter(interesting_words)

print(interesting_word_counts.most_common(25))

清理 .txt 并计算最常用的单词

Clean .txt and count most frequent words

python

string

word-count

python-2.7