希伯来语中的 spacy 句子标记化错误
spacy sentence tokenization error on Hebrew
尝试对希伯来语使用 spacy 句子标记化。
import spacy
nlp = spacy.load('he')
doc = nlp(text)
sents = list(doc.sents)
我得到:
Warning: no model found for 'he'
Only loading the 'he' tokenizer.
Traceback (most recent call last):
...
sents = list(doc.sents)
File "spacy/tokens/doc.pyx", line 438, in __get__ (spacy/tokens/doc.cpp:9707)
raise ValueError( ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: https://spacy.io/docs/usage
怎么办?
spaCy 的 Hebrew coverage is currently quite minimal. It currently only has word tokenization for Hebrew, which roughly splits on white space with some extra rules and exceptions. The sentence tokenization/boundary detection that you want requires a more sophisticated grammatical parsing of the sentence in order to determine where one sentence ends and another begins. These models require a large amount of labeled training data, so are available for a smaller number of languages than have tokenization (here 的列表)。
最初的消息告诉你它可以做标记化,这不需要模型,然后你得到的错误是没有模型来分割句子,做 NER 或 POS,等等
你可能会看看 this list for other resources for Hebrew NLP. If you find enough labeled data in the right format and you're feeling ambitious, you could train your own Hebrew spaCy model using the overview described here。
尝试对希伯来语使用 spacy 句子标记化。
import spacy
nlp = spacy.load('he')
doc = nlp(text)
sents = list(doc.sents)
我得到:
Warning: no model found for 'he'
Only loading the 'he' tokenizer.
Traceback (most recent call last):
...
sents = list(doc.sents)
File "spacy/tokens/doc.pyx", line 438, in __get__ (spacy/tokens/doc.cpp:9707)
raise ValueError( ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation: https://spacy.io/docs/usage
怎么办?
spaCy 的 Hebrew coverage is currently quite minimal. It currently only has word tokenization for Hebrew, which roughly splits on white space with some extra rules and exceptions. The sentence tokenization/boundary detection that you want requires a more sophisticated grammatical parsing of the sentence in order to determine where one sentence ends and another begins. These models require a large amount of labeled training data, so are available for a smaller number of languages than have tokenization (here 的列表)。
最初的消息告诉你它可以做标记化,这不需要模型,然后你得到的错误是没有模型来分割句子,做 NER 或 POS,等等
你可能会看看 this list for other resources for Hebrew NLP. If you find enough labeled data in the right format and you're feeling ambitious, you could train your own Hebrew spaCy model using the overview described here。