Spacy is_stop 不识别停用词?
Spacy is_stop doesn't identify stop words?
我用SpaCy识别停用词时,用en_core_web_lg
语料库不行,用en_core_web_sm
就可以了。这是一个错误,还是我做错了什么?
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
print(f' {word} | {word.is_stop}')
结果:
The | False
cat | False
ran | False
over | False
the | False
hill | False
and | False
to | False
my | False
lap | False
但是,当我将此行更改为使用 en_core_web_sm
语料库时,我得到了不同的结果:
nlp = spacy.load('en_core_web_sm')
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
尝试from spacy.lang.en.stop_words import STOP_WORDS
,然后您可以明确检查单词是否在集合中
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
# Have to convert Token type to String, otherwise types won't match
print(f' {word} | {str(word) in STOP_WORDS}')
输出如下:
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
在我看来像是一个错误。但是,这种方法还使您可以灵活地向 STOP_WORDS
集添加单词,如果您需要
您遇到的问题已记录在案 bug。建议的解决方法如下:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_lg')
for word in STOP_WORDS:
for w in (word, word[0].capitalize(), word.upper()):
lex = nlp.vocab[w]
lex.is_stop = True
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
print('{} | {}'.format(word, word.is_stop))
输出
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
我用SpaCy识别停用词时,用en_core_web_lg
语料库不行,用en_core_web_sm
就可以了。这是一个错误,还是我做错了什么?
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
print(f' {word} | {word.is_stop}')
结果:
The | False
cat | False
ran | False
over | False
the | False
hill | False
and | False
to | False
my | False
lap | False
但是,当我将此行更改为使用 en_core_web_sm
语料库时,我得到了不同的结果:
nlp = spacy.load('en_core_web_sm')
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
尝试from spacy.lang.en.stop_words import STOP_WORDS
,然后您可以明确检查单词是否在集合中
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
# Have to convert Token type to String, otherwise types won't match
print(f' {word} | {str(word) in STOP_WORDS}')
输出如下:
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False
在我看来像是一个错误。但是,这种方法还使您可以灵活地向 STOP_WORDS
集添加单词,如果您需要
您遇到的问题已记录在案 bug。建议的解决方法如下:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_lg')
for word in STOP_WORDS:
for w in (word, word[0].capitalize(), word.upper()):
lex = nlp.vocab[w]
lex.is_stop = True
doc = nlp(u'The cat ran over the hill and to my lap')
for word in doc:
print('{} | {}'.format(word, word.is_stop))
输出
The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False