将文本中的换行符 (\n) 识别为 Spacy 中的句子结尾

Question

我想将文本中的换行符识别为句子的结尾。我试过像这样将它输入到 nlp 对象中：

text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
nlp = spacy.load("en_core_web_lg")
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
    print('next sentence:')
    print(sent)

这个输出是：

next sentence:
Guest Blogging
Guest Blogging allows the user to collect backlinks

我不明白为什么 Spacy 不能将换行符识别为句子结尾。我想要的输出是：

next sentence:
Guest Blogging:
next sentence:
Guest Blogging allows the user to collect backlinks

有人知道如何实现吗？

Answer 1

sentencizer 在这里没有做任何事情的原因是 parser 首先有运行并且已经设置了所有的句子边界，然后 sentencizer不修改任何现有的句子边界。

只有当您知道输入文本中每行只有一个句子时，sentencizer 和 \n 才是正确的选择。否则，在换行符之后添加句子开始的自定义组件（但不设置所有句子边界）可能就是您想要的。

如果您想在运行解析器之前设置一些自定义句子边界，您需要确保在管道中的解析器之前添加自定义组件：

nlp.add_pipe("my_component", before="parser")

您的自定义组件会在换行符后立即为标记设置 token.is_start_start = True，并保持所有其他标记不变。

在此处查看第二个示例：https://spacy.io/usage/processing-pipelines#custom-components-simple

将文本中的换行符 (\n) 识别为 Spacy 中的句子结尾

Recognize newline (\n) in text as end of sentence in Spacy

spacy