删除包含常见停用词的二元组
removing bigrams that contain common stopwords
我有一个功能如下。它 returns 一个句子中的所有双字母和三字母。我只想保留不包含任何停用词的二元组和三元组。我如何使用 from nltk.copus import stopwords
来做同样的事情?
我知道如何在创建二元组和三元组之前删除停用词。但我想在创建双字母组和三字母组后删除停用词。
from nltk import everygrams
from nltk.copus import stopwords
def clean_text_function2(text):
t = text #contractions.fix(text)
t= t.lower().split()#lower case
t = [(re.sub(r'[^a-z ]', '', ch)) for ch in t]#remove everything other than a-z
#t=[word for word in t if word not in stopword]#removing stop words
t= [wordnet_lemmatizer.lemmatize(word) for word in t]
t=[snowball_stemmer.stem(word) for word in t]
t=(' ').join(t)
t=list(everygrams(t.split(), 2, 3))
return t
print (clean_text_function2("i love when it rains a lot and brings temperature down"))
[('i', 'love'), ('love', 'when'), ('when', 'it'), ('it', 'rain'), ('rain', 'a'), ('a', 'lot'), ('lot', 'and'), ('and', 'bring'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('love', 'when', 'it'), ('when', 'it', 'rain'), ('it', 'rain', 'a'), ('rain', 'a', 'lot'), ('a', 'lot', 'and'), ('lot', 'and', 'bring'), ('and', 'bring', 'temperatur'), ('bring', 'temperatur', 'down')]
做一个过滤器,只保留没有停用词的元组。我会过于冗长以确保技术可读。
对于每个克,使用 any
检查任何给定的停用词。
grams = [('i', 'love'), ('love', 'when'), ('when', 'it'), ('it', 'rain'), ('rain', 'a'), ('a', 'lot'), ('lot', 'and'), ('and', 'bring'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('love', 'when', 'it'), ('when', 'it', 'rain'), ('it', 'rain', 'a'), ('rain', 'a', 'lot'), ('a', 'lot', 'and'), ('lot', 'and', 'bring'), ('and', 'bring', 'temperatur'), ('bring', 'temperatur', 'down')]
stops = ["a", "and", "it", "the"]
clean = [gram for gram in grams if not any(stop in gram for stop in stops)]
print(clean)
输出:
[('i', 'love'), ('love', 'when'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('bring', 'temperatur', 'down')]
我有一个功能如下。它 returns 一个句子中的所有双字母和三字母。我只想保留不包含任何停用词的二元组和三元组。我如何使用 from nltk.copus import stopwords
来做同样的事情?
我知道如何在创建二元组和三元组之前删除停用词。但我想在创建双字母组和三字母组后删除停用词。
from nltk import everygrams
from nltk.copus import stopwords
def clean_text_function2(text):
t = text #contractions.fix(text)
t= t.lower().split()#lower case
t = [(re.sub(r'[^a-z ]', '', ch)) for ch in t]#remove everything other than a-z
#t=[word for word in t if word not in stopword]#removing stop words
t= [wordnet_lemmatizer.lemmatize(word) for word in t]
t=[snowball_stemmer.stem(word) for word in t]
t=(' ').join(t)
t=list(everygrams(t.split(), 2, 3))
return t
print (clean_text_function2("i love when it rains a lot and brings temperature down"))
[('i', 'love'), ('love', 'when'), ('when', 'it'), ('it', 'rain'), ('rain', 'a'), ('a', 'lot'), ('lot', 'and'), ('and', 'bring'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('love', 'when', 'it'), ('when', 'it', 'rain'), ('it', 'rain', 'a'), ('rain', 'a', 'lot'), ('a', 'lot', 'and'), ('lot', 'and', 'bring'), ('and', 'bring', 'temperatur'), ('bring', 'temperatur', 'down')]
做一个过滤器,只保留没有停用词的元组。我会过于冗长以确保技术可读。
对于每个克,使用 any
检查任何给定的停用词。
grams = [('i', 'love'), ('love', 'when'), ('when', 'it'), ('it', 'rain'), ('rain', 'a'), ('a', 'lot'), ('lot', 'and'), ('and', 'bring'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('love', 'when', 'it'), ('when', 'it', 'rain'), ('it', 'rain', 'a'), ('rain', 'a', 'lot'), ('a', 'lot', 'and'), ('lot', 'and', 'bring'), ('and', 'bring', 'temperatur'), ('bring', 'temperatur', 'down')]
stops = ["a", "and", "it", "the"]
clean = [gram for gram in grams if not any(stop in gram for stop in stops)]
print(clean)
输出:
[('i', 'love'), ('love', 'when'), ('bring', 'temperatur'), ('temperatur', 'down'), ('i', 'love', 'when'), ('bring', 'temperatur', 'down')]