过滤标点符号附近的停用词

Question

我正在尝试像这样过滤掉文本中的停用词：

clean = ' '.join([word for word in text.split() if word not in (stopwords)])

问题是 text.split() 有像 'word.' 这样的元素与停用词 'word' 不匹配。

我后来在 sent_tokenize(clean) 中使用了 clean，所以我不想完全去掉标点符号。

如何在 保留标点符号 的同时过滤掉停用词，但过滤掉像 'word.' 这样的词？

我认为可以更改标点符号：

text = text.replace('.',' . ')

然后是

clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")

但是有没有更好的方法呢？

Answer 1

你可以使用这样的东西：

import re

clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])

这将提取除小写和大写 ascii 字母以外的所有内容，并将其与 stopcase 集合或列表中的单词匹配。此外，它假设停用词中的所有单词都是小写的，这就是我将单词转换为全部小写的原因。如果我做出了很大的假设，请将其删除

此外，我不精通正则表达式，如果有更简洁或更可靠的方法，我深表歉意。

Answer 2

首先对文本进行分词，然后从停用词中清除它。分词器通常可以识别标点符号。

import nltk

text = 'Son, if you really want something in this life,\
        you have to work for it. Now quiet! They are about\
        to announce the lottery numbers.'

stopwords = ['in', 'to', 'for', 'the']

sents = []

for sent in nltk.sent_tokenize(text):

    tokens = nltk.word_tokenize(sent)
    sents.append(' '.join([w for w in tokens if w not in stopwords]))

print sents

['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']

过滤标点符号附近的停用词

filtering stopwords near punctuation

python

nlp

nltk