python spacy 分句器
python spacy sentence splitter
我想使用 spacy
从文本中提取句子。
nlp = English() # just the language with no model
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
是否可以增加分句绕过规则的可靠性,例如从不在首字母缩略词如“no.”之后分句。
想象一下,我当然有一堆非常专业和特殊的首字母缩略词。
您将如何进行?
您可以编写一个自定义函数,通过使用 rule-based 拆分句子的方法来更改默认行为。例如:
import spacy
text = "The formula is no. 45. This num. represents the chemical properties."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("Before:", [sent.text for sent in doc.sents])
def set_custom_boundaries(doc):
pattern_a = ['no', 'num']
for token in doc[:-1]:
if token.text in pattern_a and doc[token.i + 1].text == '.':
doc[token.i + 2].is_sent_start = False
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print("After:", [sent.text for sent in doc.sents])
这将为您提供所需的句子拆分。
Before: ['The formula is no.', '45.', 'This num.', 'represents the chemical properties.']
After: ['The formula is no. 45.', 'This num. represents the chemical properties.']
我想使用 spacy
从文本中提取句子。
nlp = English() # just the language with no model
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
是否可以增加分句绕过规则的可靠性,例如从不在首字母缩略词如“no.”之后分句。
想象一下,我当然有一堆非常专业和特殊的首字母缩略词。
您将如何进行?
您可以编写一个自定义函数,通过使用 rule-based 拆分句子的方法来更改默认行为。例如:
import spacy
text = "The formula is no. 45. This num. represents the chemical properties."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("Before:", [sent.text for sent in doc.sents])
def set_custom_boundaries(doc):
pattern_a = ['no', 'num']
for token in doc[:-1]:
if token.text in pattern_a and doc[token.i + 1].text == '.':
doc[token.i + 2].is_sent_start = False
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print("After:", [sent.text for sent in doc.sents])
这将为您提供所需的句子拆分。
Before: ['The formula is no.', '45.', 'This num.', 'represents the chemical properties.']
After: ['The formula is no. 45.', 'This num. represents the chemical properties.']