NLP:基于分隔符创建spaCy Doc对象或组合多个Doc对象形成单个对象

NLP: Create spaCy Doc objects based on delimiters or combine multiple Doc objects to form a single object

我正在尝试使用 make_doc() 函数创建一个 spaCy Doc 对象 (spacy.tokens.doc.Doc)。这就是我所做的:

import spacy
nlp = spacy.load('en')

a = nlp.make_doc("Sam, Software Engineer")
print(list(a)) # [Sam, ,, Software, Engineer]

但我想要的结果是:

print(list(a)) # [Sam, Software Engineer]

有没有一种方法可以根据分隔符(在我的例子中是逗号)创建一个 spacy Doc 对象?或者有没有办法将两个 spaCy Doc 对象组合成一个对象?例如:

a = nlp.make_doc("Sam")
b = nlp.make_doc("Software Engineer")
c = Combine a and b into single Doc object c
print(list(c)) # [Sam, Software Engineer]

您可以在用逗号分隔字符串后使用 Doc class 构建文档:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Sam, Software Engineer"

tokens = text.split(',')
words_t = [t.strip() for t in tokens]
whitespaces_t = [x[0].isspace() for x in tokens]
a = spacy.tokens.Doc(nlp.vocab, words=words_t, spaces=whitespaces_t)
print(list(a))
# => [Sam, Software Engineer]

words_t = [t.strip() for t in tokens] 部分抓取单词,whitespaces_t = [x[0].isspace() for x in tokens] 创建一个布尔值列表,表示单词前存在空格。