使用 NLTK 从训练语料库中提取名词短语并删除停用词时出错
Error when extract noun-phrases from the training corpus and remove stop words using NLTK
我对 python 和 NLTK 都是新手。我必须从语料库中提取名词短语,然后使用 NLTK 删除停用词。我已经完成了编码,但仍然有错误。谁能帮我解决这个问题?或者如果有更好的解决方案也请推荐。谢谢
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
docid='19509'
title='Example noun-phrase and stop words'
print('Document id:'),docid
print('Title:'),title
#list noun phrase
content='This is a sample sentence, showing off the stop words filtration.'
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = nltk.word_tokenize(content)
nouns = [word for (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print('All Noun Phrase:'),nouns
#remove stop words
stop_words = set(stopwords.words("english"))
example_words = word_tokenize(nouns)
filtered_sentence = []
for w in example_words:
if w not in stop_words:
filtered_sentence.append(w)
print('Without stop words:'),filtered_sentence
我得到以下错误
Traceback (most recent call last):
File "C:\Users\User\Desktop\NLP\stop_word.py", line 20, in <module>
example_words = word_tokenize(nouns)
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in
word_tokenize
return [token for sent in sent_tokenize(text, language)
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in
sent_tokenize
return tokenizer.tokenize(text)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in
tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in
sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text,realign_boundaries)]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in
span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in
_realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in
_pair_iter
prev = next(it)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in
_slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
您收到此错误是因为函数 word_tokenize
需要一个字符串作为参数,而您给出了一个字符串列表。
据我了解您要实现的目标,此时您不需要标记化。在 print('All Noun Phrase:'),nouns
之前,您拥有句子中的所有名词。要删除停用词,您可以使用:
### remove stop words ###
stop_words = set(stopwords.words("english"))
# find the nouns that are not in the stopwords
nouns_without_stopwords = [noun for noun in nouns if noun not in stop_words]
# your sentence is now clear
print('Without stop words:',nouns_without_stopwords)
当然,在这种情况下,名词的结果相同,因为 none 的名词是停用词。
希望对您有所帮助。
我对 python 和 NLTK 都是新手。我必须从语料库中提取名词短语,然后使用 NLTK 删除停用词。我已经完成了编码,但仍然有错误。谁能帮我解决这个问题?或者如果有更好的解决方案也请推荐。谢谢
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
docid='19509'
title='Example noun-phrase and stop words'
print('Document id:'),docid
print('Title:'),title
#list noun phrase
content='This is a sample sentence, showing off the stop words filtration.'
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = nltk.word_tokenize(content)
nouns = [word for (word,pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print('All Noun Phrase:'),nouns
#remove stop words
stop_words = set(stopwords.words("english"))
example_words = word_tokenize(nouns)
filtered_sentence = []
for w in example_words:
if w not in stop_words:
filtered_sentence.append(w)
print('Without stop words:'),filtered_sentence
我得到以下错误
Traceback (most recent call last):
File "C:\Users\User\Desktop\NLP\stop_word.py", line 20, in <module>
example_words = word_tokenize(nouns)
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 109,in
word_tokenize
return [token for sent in sent_tokenize(text, language)
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in
sent_tokenize
return tokenizer.tokenize(text)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in
tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in
sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text,realign_boundaries)]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in
span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in
_realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in
_pair_iter
prev = next(it)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in
_slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
您收到此错误是因为函数 word_tokenize
需要一个字符串作为参数,而您给出了一个字符串列表。
据我了解您要实现的目标,此时您不需要标记化。在 print('All Noun Phrase:'),nouns
之前,您拥有句子中的所有名词。要删除停用词,您可以使用:
### remove stop words ###
stop_words = set(stopwords.words("english"))
# find the nouns that are not in the stopwords
nouns_without_stopwords = [noun for noun in nouns if noun not in stop_words]
# your sentence is now clear
print('Without stop words:',nouns_without_stopwords)
当然,在这种情况下,名词的结果相同,因为 none 的名词是停用词。
希望对您有所帮助。