删除单个字母停用词而不从包含它的单词中删除字母

Question

我正在尝试从我的文本中删除停用词。

我试过使用下面的代码。

from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text='I love coding'
my_text=re.sub("|".join(sw),"",my_text)
print(my_text)

预期结果：love coding。实际结果：I l cng（因为 'o' 和 've' 都在停用词列表 "sw" 中找到）。

怎样才能得到预期的结果？

Answer 1

您需要替换单词，而不是字符：

from itertools import filterfalse
from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text = 'I love coding'
my_words = my_text.split() # naive split to words
no_stopwords = ' '.join(filterfalse(sw.__contains__, my_words))

您还应该担心拆分句子、区分大小写等问题

有库可以正确地做到这一点，因为这是一个常见的、重要的问题。

Answer 2

在删除停用词之前将句子拆分为单词，然后运行

from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
sentence = 'I love coding'
print([i for i in sentence.lower().split() if i not in stop])
>>> ['love', 'coding']
print(" ".join([i for i in sentence.lower().split() if i not in stop]))
>>> "love coding"

删除单个字母停用词而不从包含它的单词中删除字母

Removing single letter stopwords without removing the letter from words containing it

python

stop-words