NLTK 替换停用词

NLTK Replacing the stopwords

我正在使用 NLTK 将所有停用词替换为字符串 "QQQQQ"。问题是,如果输入的句子(我从中删除停用词)有不止一个句子,那么它就不能正常工作。

我有以下代码:

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized=word_tokenize(ex_text)

stop_words=set(stopwords.words('english'))
stop_words.add(".")  #Since I do not need punctuation, I added . and ,
stop_words.add(",")

# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w in stop_words:    
        stopword_pos.append(tokenized.index(w))

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]]='QQQQQ'  

print(tokenized)

该代码给出以下输出:

['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']

您可能会注意到,它不会替换 'is' 和“.”等停用词。 (我在集合中添加了句号,因为我不想要标点符号)。

但请记住 'is' 和“.”在第一句中被替换,但是 'is' 和 '.'在第二句中不要。

发生的另一件奇怪的事情是,当我打印 stopword_pos 时,我得到以下输出:

[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]

您可能会注意到,数字似乎是按升序排列的,但突然间,列表中“20”之后的“1”应该是停用词的位置。此外,'29' 之后有 '0','25' 之后有 '20'。也许这可以说明问题所在。

所以,问题是在第一句话之后,停用词没有被替换为“QQQQQ”。这是为什么?

非常感谢任何指向正确方向的信息。我不知道如何解决这个问题。

tokenized.index(w) 这让您在列表中第一次出现该项目。

因此,您可以尝试一些替代方法来替换停用词,而不是使用索引。

tokenized_new = [ word if word not in stop_words else 'QQQQQ' for word in tokenized ]

问题是,.index 没有 return 所有索引,因此,您将需要与其他 question 中提到的类似的东西。

stopword_pos_set = set() # creating set so that index is not added twice
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = list(stopword_pos_set) # convert to list

在上面,我创建了 stopword_pos_set,所以相同的索引不会被添加两次,它只会分配相同的值两次但是当你打印 stopword_pos 而没有 set 你会看到重复值。

一个建议是,在上面的代码中,我将其更改为if w.lower() in stop_words:,这样当您检查stopwords时不区分大小写,否则'This''this'

其他建议是使用 .update 方法更新 stop_words 中的多个项目 stop_words.update([".", ","]) 而不是 .add 多次更新。


您可以尝试如下:

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized = word_tokenize(ex_text)
stop_words = set(stopwords.words('english'))
stop_words.update([".", ","])  #Since I do not need punctuation, I added . and ,

stopword_pos_set = set()
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = sorted(list(stopword_pos_set)) # set to list

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]] = 'QQQQQ'  

print(tokenized)
print(stopword_pos)