spacy 中的句子标记化很糟糕(?)
Sentence tokenization in spacy is bad (?)
为什么 spacy 作品中的句子 splitter/tokenizer 不好? nltk 似乎工作正常。这是我的小经验:
import spacy
nlp = spacy.load('fr')
import nltk
text_fr = u"Je suis parti a la boulangerie. J'ai achete trois croissants. C'etait super bon."
nltk.sent_tokenize(text_fr)
# [u'Je suis parti a la boulangerie.',
# u"J'ai achete trois croissants.",
# u"C'etait super bon."
doc = nlp(text_fr)
for s in doc.sents: print s
# Je suis parti
# a la boulangerie. J'ai
# achete trois croissants. C'
# etait super bon.
我注意到英语也有同样的行为。对于这段文字:
text = u"I went to the library. I did not know what book to buy, but then the lady working there helped me. It was cool. I discovered a lot of new things."
我喜欢 spacy(在 nlp=spacy.load('en')
之后):
I
went to the library. I
did not know what book to buy, but
then the lady working there helped me. It was cool. I discovered a
lot of new things.
与看起来不错的 nltk 对比:
[u'I went to the library.',
u'I did not know what book to buy, but then the lady working there helped me.',
u'It was cool.',
u'I discovered a lot of new things.']
我现在不知道如何,但事实证明我使用的是旧版本的 spacy (v 0.100)。我又安装了最新的spacy (v2.0.4) 现在分句更连贯了
为什么 spacy 作品中的句子 splitter/tokenizer 不好? nltk 似乎工作正常。这是我的小经验:
import spacy
nlp = spacy.load('fr')
import nltk
text_fr = u"Je suis parti a la boulangerie. J'ai achete trois croissants. C'etait super bon."
nltk.sent_tokenize(text_fr)
# [u'Je suis parti a la boulangerie.',
# u"J'ai achete trois croissants.",
# u"C'etait super bon."
doc = nlp(text_fr)
for s in doc.sents: print s
# Je suis parti
# a la boulangerie. J'ai
# achete trois croissants. C'
# etait super bon.
我注意到英语也有同样的行为。对于这段文字:
text = u"I went to the library. I did not know what book to buy, but then the lady working there helped me. It was cool. I discovered a lot of new things."
我喜欢 spacy(在 nlp=spacy.load('en')
之后):
I
went to the library. I
did not know what book to buy, but
then the lady working there helped me. It was cool. I discovered a
lot of new things.
与看起来不错的 nltk 对比:
[u'I went to the library.',
u'I did not know what book to buy, but then the lady working there helped me.',
u'It was cool.',
u'I discovered a lot of new things.']
我现在不知道如何,但事实证明我使用的是旧版本的 spacy (v 0.100)。我又安装了最新的spacy (v2.0.4) 现在分句更连贯了