NLTK 替换停用词
NLTK Replacing the stopwords
我正在使用 NLTK 将所有停用词替换为字符串 "QQQQQ"
。问题是,如果输入的句子(我从中删除停用词)有不止一个句子,那么它就不能正常工作。
我有以下代码:
ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'
tokenized=word_tokenize(ex_text)
stop_words=set(stopwords.words('english'))
stop_words.add(".") #Since I do not need punctuation, I added . and ,
stop_words.add(",")
# I need to note the position of all the stopwords for later use
for w in tokenized:
if w in stop_words:
stopword_pos.append(tokenized.index(w))
# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
tokenized[stopword_pos[i]]='QQQQQ'
print(tokenized)
该代码给出以下输出:
['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']
您可能会注意到,它不会替换 'is' 和“.”等停用词。 (我在集合中添加了句号,因为我不想要标点符号)。
但请记住 'is' 和“.”在第一句中被替换,但是 'is' 和 '.'在第二句中不要。
发生的另一件奇怪的事情是,当我打印 stopword_pos
时,我得到以下输出:
[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]
您可能会注意到,数字似乎是按升序排列的,但突然间,列表中“20”之后的“1”应该是停用词的位置。此外,'29' 之后有 '0','25' 之后有 '20'。也许这可以说明问题所在。
所以,问题是在第一句话之后,停用词没有被替换为“QQQQQ”。这是为什么?
非常感谢任何指向正确方向的信息。我不知道如何解决这个问题。
tokenized.index(w)
这让您在列表中第一次出现该项目。
因此,您可以尝试一些替代方法来替换停用词,而不是使用索引。
tokenized_new = [ word if word not in stop_words else 'QQQQQ' for word in tokenized ]
问题是,.index
没有 return 所有索引,因此,您将需要与其他 question 中提到的类似的东西。
stopword_pos_set = set() # creating set so that index is not added twice
# I need to note the position of all the stopwords for later use
for w in tokenized:
if w.lower() in stop_words:
indices = [i for i, x in enumerate(tokenized) if x == w]
stopword_pos_set.update(indices)
stopword_pos = list(stopword_pos_set) # convert to list
在上面,我创建了 stopword_pos_set
,所以相同的索引不会被添加两次,它只会分配相同的值两次但是当你打印 stopword_pos
而没有 set
你会看到重复值。
一个建议是,在上面的代码中,我将其更改为if w.lower() in stop_words:
,这样当您检查stopwords
时不区分大小写,否则'This'
与'this'
。
其他建议是使用 .update
方法更新 stop_words
中的多个项目 stop_words.update([".", ","])
而不是 .add
多次更新。
您可以尝试如下:
ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'
tokenized = word_tokenize(ex_text)
stop_words = set(stopwords.words('english'))
stop_words.update([".", ","]) #Since I do not need punctuation, I added . and ,
stopword_pos_set = set()
# I need to note the position of all the stopwords for later use
for w in tokenized:
if w.lower() in stop_words:
indices = [i for i, x in enumerate(tokenized) if x == w]
stopword_pos_set.update(indices)
stopword_pos = sorted(list(stopword_pos_set)) # set to list
# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
tokenized[stopword_pos[i]] = 'QQQQQ'
print(tokenized)
print(stopword_pos)
我正在使用 NLTK 将所有停用词替换为字符串 "QQQQQ"
。问题是,如果输入的句子(我从中删除停用词)有不止一个句子,那么它就不能正常工作。
我有以下代码:
ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'
tokenized=word_tokenize(ex_text)
stop_words=set(stopwords.words('english'))
stop_words.add(".") #Since I do not need punctuation, I added . and ,
stop_words.add(",")
# I need to note the position of all the stopwords for later use
for w in tokenized:
if w in stop_words:
stopword_pos.append(tokenized.index(w))
# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
tokenized[stopword_pos[i]]='QQQQQ'
print(tokenized)
该代码给出以下输出:
['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']
您可能会注意到,它不会替换 'is' 和“.”等停用词。 (我在集合中添加了句号,因为我不想要标点符号)。
但请记住 'is' 和“.”在第一句中被替换,但是 'is' 和 '.'在第二句中不要。
发生的另一件奇怪的事情是,当我打印 stopword_pos
时,我得到以下输出:
[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]
您可能会注意到,数字似乎是按升序排列的,但突然间,列表中“20”之后的“1”应该是停用词的位置。此外,'29' 之后有 '0','25' 之后有 '20'。也许这可以说明问题所在。
所以,问题是在第一句话之后,停用词没有被替换为“QQQQQ”。这是为什么?
非常感谢任何指向正确方向的信息。我不知道如何解决这个问题。
tokenized.index(w)
这让您在列表中第一次出现该项目。
因此,您可以尝试一些替代方法来替换停用词,而不是使用索引。
tokenized_new = [ word if word not in stop_words else 'QQQQQ' for word in tokenized ]
问题是,.index
没有 return 所有索引,因此,您将需要与其他 question 中提到的类似的东西。
stopword_pos_set = set() # creating set so that index is not added twice
# I need to note the position of all the stopwords for later use
for w in tokenized:
if w.lower() in stop_words:
indices = [i for i, x in enumerate(tokenized) if x == w]
stopword_pos_set.update(indices)
stopword_pos = list(stopword_pos_set) # convert to list
在上面,我创建了 stopword_pos_set
,所以相同的索引不会被添加两次,它只会分配相同的值两次但是当你打印 stopword_pos
而没有 set
你会看到重复值。
一个建议是,在上面的代码中,我将其更改为if w.lower() in stop_words:
,这样当您检查stopwords
时不区分大小写,否则'This'
与'this'
。
其他建议是使用 .update
方法更新 stop_words
中的多个项目 stop_words.update([".", ","])
而不是 .add
多次更新。
您可以尝试如下:
ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'
tokenized = word_tokenize(ex_text)
stop_words = set(stopwords.words('english'))
stop_words.update([".", ","]) #Since I do not need punctuation, I added . and ,
stopword_pos_set = set()
# I need to note the position of all the stopwords for later use
for w in tokenized:
if w.lower() in stop_words:
indices = [i for i, x in enumerate(tokenized) if x == w]
stopword_pos_set.update(indices)
stopword_pos = sorted(list(stopword_pos_set)) # set to list
# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
tokenized[stopword_pos[i]] = 'QQQQQ'
print(tokenized)
print(stopword_pos)