删除包含常见停用词的二元组

Question

我有一个功能如下。它 returns 一个句子中的所有双字母和三字母。我只想保留不包含任何停用词的二元组和三元组。我如何使用 from nltk.copus import stopwords 来做同样的事情？

我知道如何在创建二元组和三元组之前删除停用词。但我想在创建双字母组和三字母组后删除停用词。

from nltk import everygrams
from nltk.copus import stopwords


def clean_text_function2(text):
    t = text #contractions.fix(text)
    t= t.lower().split()#lower case
    t = [(re.sub(r'[^a-z ]', '', ch)) for ch in t]#remove everything other than a-z
    #t=[word for word in t if word not in stopword]#removing stop words
    t= [wordnet_lemmatizer.lemmatize(word) for word in t]
    t=[snowball_stemmer.stem(word) for word in t]
    t=(' ').join(t)
    t=list(everygrams(t.split(), 2, 3))
    return t






print (clean_text_function2("i love when it rains a lot and brings temperature down"))

[('i', 'love'), ('love', 'when'), ('when', 'it'), ('it', 'rain'), ('rain', 'a'), ('a', 'lot'), ('lot', 'and'), ('and', 'bring'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('love', 'when', 'it'), ('when', 'it', 'rain'), ('it', 'rain', 'a'), ('rain', 'a', 'lot'), ('a', 'lot', 'and'), ('lot', 'and', 'bring'), ('and', 'bring', 'temperatur'), ('bring', 'temperatur', 'down')]

Answer 1

做一个过滤器，只保留没有停用词的元组。我会过于冗长以确保技术可读。

对于每个克，使用 any 检查任何给定的停用词。

grams = [('i', 'love'), ('love', 'when'), ('when', 'it'), ('it', 'rain'), ('rain', 'a'), ('a', 'lot'), ('lot', 'and'), ('and', 'bring'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('love', 'when', 'it'), ('when', 'it', 'rain'), ('it', 'rain', 'a'), ('rain', 'a', 'lot'), ('a', 'lot', 'and'), ('lot', 'and', 'bring'), ('and', 'bring', 'temperatur'), ('bring', 'temperatur', 'down')]
stops = ["a", "and", "it", "the"]

clean = [gram for gram in grams if not any(stop in gram for stop in stops)]
print(clean)

输出：

[('i', 'love'), ('love', 'when'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('bring', 'temperatur', 'down')]

删除包含常见停用词的二元组

removing bigrams that contain common stopwords

python

nltk

stop-words