NLTK 替换停用词

Question

我正在使用 NLTK 将所有停用词替换为字符串 "QQQQQ"。问题是，如果输入的句子（我从中删除停用词）有不止一个句子，那么它就不能正常工作。

我有以下代码：

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized=word_tokenize(ex_text)

stop_words=set(stopwords.words('english'))
stop_words.add(".")  #Since I do not need punctuation, I added . and ,
stop_words.add(",")

# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w in stop_words:    
        stopword_pos.append(tokenized.index(w))

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]]='QQQQQ'  

print(tokenized)

该代码给出以下输出：

['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']

您可能会注意到，它不会替换 'is' 和“.”等停用词。（我在集合中添加了句号，因为我不想要标点符号）。

但请记住 'is' 和“.”在第一句中被替换，但是 'is' 和 '.'在第二句中不要。

发生的另一件奇怪的事情是，当我打印 stopword_pos 时，我得到以下输出：

[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]

您可能会注意到，数字似乎是按升序排列的，但突然间，列表中“20”之后的“1”应该是停用词的位置。此外，'29' 之后有 '0'，'25' 之后有 '20'。也许这可以说明问题所在。

所以，问题是在第一句话之后，停用词没有被替换为“QQQQQ”。这是为什么？

非常感谢任何指向正确方向的信息。我不知道如何解决这个问题。

Answer 1

tokenized.index(w) 这让您在列表中第一次出现该项目。

因此，您可以尝试一些替代方法来替换停用词，而不是使用索引。

tokenized_new = [ word if word not in stop_words else 'QQQQQ' for word in tokenized ]

Answer 2

问题是，.index 没有 return 所有索引，因此，您将需要与其他 question 中提到的类似的东西。

stopword_pos_set = set() # creating set so that index is not added twice
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = list(stopword_pos_set) # convert to list

在上面，我创建了 stopword_pos_set，所以相同的索引不会被添加两次，它只会分配相同的值两次但是当你打印 stopword_pos 而没有 set 你会看到重复值。

一个建议是，在上面的代码中，我将其更改为if w.lower() in stop_words:，这样当您检查stopwords时不区分大小写，否则'This'与'this'。

其他建议是使用 .update 方法更新 stop_words 中的多个项目 stop_words.update([".", ","]) 而不是 .add 多次更新。

您可以尝试如下：

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized = word_tokenize(ex_text)
stop_words = set(stopwords.words('english'))
stop_words.update([".", ","])  #Since I do not need punctuation, I added . and ,

stopword_pos_set = set()
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = sorted(list(stopword_pos_set)) # set to list

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]] = 'QQQQQ'  

print(tokenized)
print(stopword_pos)

NLTK 替换停用词

NLTK Replacing the stopwords

python

list

set

nltk

stop-words