为什么我的程序不能像我编程那样过滤掉停用词和标点符号？ (Python & NLTK)

Question

对于我的数据科学课程中的实验室，我必须使用 NLTK 在 Python 中创建一个程序进行自然语言处理。我们必须使用 for 循环遍历 macbeth 的每个单词，并通过向另一个列表添加不停止的 word/punctuation 单词来过滤掉所有英文停用词和标点符号。然后，我们必须从过滤后的列表中打印出最常见单词及其出现频率的列表。我原以为我在逻辑上做的一切都是正确的，但结果包括标点符号和停用词（见下文）。我在这里做错了什么？（P.S。这是我第一次使用 NLTK）。

程序:

# import required libraries and modules
import nltk
from nltk.corpus import gutenberg, stopwords
from nltk.probability import FreqDist

macbeth_allwords = gutenberg.words('shakespeare-macbeth.txt') # read in words from macbeth
macbeth_noStop = [] # empty list to hold words from macbeth excluding stopwords
punctuations = [".", "!", "?", ",", ";", ":", "-", "[", "]", "{", "}", "(", ")", "/", "*", "~",
"<", ">", "`", "^", "_", "|", "#", "$", "%", "+", "=", "&", "@", " "] # list of common punctuation characters

# iterate through each word in macbeth, making a new list excluding all the stopwords and punctuation characters
for word in macbeth_allwords:
    if (word not in stopwords.words('english')) or (word not in punctuations):
        macbeth_noStop.append(word)

macbeth_freq = FreqDist(macbeth_noStop) # get word frequencies from the filtered list of words from macbeth

# print the 50 most common words from the filtered list of words from macbeth
print("50 Most Common Words in Macbeth (no stopwords or punctuation):")
print("-----------------------------------------------")
print(macbeth_freq.most_common(50))

输出：

50 Most Common Words in Macbeth (no stopwords or punctuation):
-----------------------------------------------
[(',', 1962), ('.', 1235), ("'", 637), ('the', 531), (':', 477), ('and', 376), ('I', 333), ('of', 315), ('to', 311), ('?', 241), ('d', 224), ('a', 214), ('you', 184), ('in', 173), ('my', 170), ('And', 170), ('is', 166), ('that', 158), ('not', 155), ('it', 138), ('Macb', 137), ('with', 134), ('s', 131), ('his', 129), ('be', 124), ('The', 118), ('haue', 117), ('me', 111), ('your', 110), ('our', 103), ('-', 100), ('him', 90), ('for', 82), ('Enter', 80), ('That', 80), ('this', 79), ('he', 76), ('What', 74), ('To', 73), ('so', 70), ('all', 67), ('thou', 63), ('are', 63), ('will', 62), ('Macbeth', 61), ('thee', 61), ('but', 60), ('But', 60), ('on', 59), ('they', 58)]

Answer 1

除逻辑条件外，一切正常。您打算使用 and 而不是 or

if word not in stopwords.word('english') and word not in punctuations

迂腐的注解：您可以使用集合而不是列表来表示标点符号，这样查找会更快:)

Answer 2

如前面的回答所述，使用的运算符不正确。

macbeth_noStop = [token for token in macbeth_allwords if token not in string.punctuation and token not in stopwords.words('english')]

此外，您可以导入字符串并改用 string.punctuation。

Answer 3

我想这会稍微更有效率（并且仍然可读）：

[word for word in tokenized if not (word in nltk.corpus.stopwords.words("english") or word in string.punctuation)]

为什么我的程序不能像我编程那样过滤掉停用词和标点符号？ (Python & NLTK)

Why won't my program filter out stop words and punctuation as I programmed it to do? (Python & NLTK)

python

nlp

nltk

stop-words

word-frequency