将文本分成句子 NLTK vs spaCy

Question

我想将文本分成句子。

查看堆栈溢出我发现：

使用 NLTK

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weathe is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

有空间

from spacy.lang.en import English # updated

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

问题是 spacy 的背景是什么，必须用所谓的 create_pipe 做不同的事情。句子对于训练您自己的 NLP 词嵌入很重要。 spaCy 不直接包含一个句子分词器应该是有原因的。

谢谢。

注意：请注意，简单的 .split(.) 不起作用，文本中有几个十进制数字和其他类型的包含“.”的标记

Answer 1

spaCy 中的处理管道采用模块化设置，此处提供了更多信息：https://spacy.io/usage/processing-pipelines。您可以通过定义管道来定义所需的部分。有些 use-cases 可能不需要句子，比如当您只需要 bag-of-words 表示时。所以我想这可能就是为什么 sentencizer 并不总是自动包含的原因 - 但如果您需要它，它就在那里。

请注意，English() 是一个非常通用的模型 - 您可以在此处找到一些更有用的 pre-trained 统计模型：https://spacy.io/models/en

Answer 2

默认情况下，spaCy 使用其依赖解析器进行句子分割，这需要加载统计模型。 sentencizer 是一个 rule-based 句子分割器，您可以使用它来定义自己的句子分割规则而无需加载模型。

如果您不介意让解析器处于激活状态，您可以使用以下代码：

import spacy
nlp = spacy.load('en_core_web_sm') # or whatever model you have installed
raw_text = 'Hello, world. Here are two sentences.'
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]

将文本分成句子 NLTK vs spaCy

separate texts into sentences NLTK vs spaCy

python

nlp

nltk

sentence

spacy