spaCy (v3.0) `nlp.make_doc(text)` 和 `nlp(text)` 之间的区别?为什么我们在训练时要使用 `nlp.make_doc(text)`?

Difference between spaCy's (v3.0) `nlp.make_doc(text)` and `nlp(text)`? Why should we use `nlp.make_doc(text)` when training?

我知道我们应该创建 Example objects and pass it to the nlp.update() method. According to the example in the docs,我们有

for raw_text, entity_offsets in train_data:
    doc = nlp.make_doc(raw_text)
    example = Example.from_dict(doc, {"entities": entity_offsets})
    nlp.update([example], sgd=optimizer)

并且查看 make_doc() 方法的 source code,似乎我们只是对输入文本进行标记,然后对标记进行注释。

但是 Example 对象应该有参考/“黄金标准”和预测值。当我们调用 nlp.make_doc() 时,信息如何最终出现在文档中?

此外,当尝试从 Example 对象获取预测的实体标签(使用训练有素的 nlp 管道)时,我没有得到任何实体(尽管如果我使用nlp(text)。如果我尝试使用 nlp(text) 而不是 nlp.make_doc(text)

,训练就会崩溃
...
>>> spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs()
ValueError()

您也可以在 Github 讨论版上随意提出此类问题。也感谢您花时间思考这个问题并在提问之前阅读了一些代码。我希望每个问题都是这样。

总之。我认为 Example.from_dict() 构造函数可能会妨碍理解 class 的工作原理。这会让您更清楚吗?

from spacy.tokens import Doc, Span
from spacy.training import Example
import spacy
nlp = spacy.blank("en")

# Build a reference Doc object, representing the gold standard.
y = Doc(
    nlp.vocab,
    words=["I", "work", "at", "Berlin!", ".", "It", "'s", "a", "hipster", "bar", "."]
)
# There are other ways we could set up the Doc object, including just passing
# stuff into the constructor. I wanted to show modifying the Doc to set annotations.
ent_start = y.text.index("Berlin!")
assert ent_start != -1
ent_end = ent_start + len("Berlin!")
y.ents = [y.char_span(ent_start, ent_end, label="ORG")]
# Okay, so we have our gold-standard, aka reference aka y, Doc object.
# Now, at runtime we won't necessarily be tokenizing that input text that way.
# It's a weird entity. If we only learn from the gold tokens, we can never learn
# to tag this correctly, no matter how many examples we see, if the predicted tokens
# don't match this tokenization. Because we'll always be learning from "Berlin!" but
# seeing "Berlin", "!" at runtime. We'll have train/test skew. Since spaCy cares how
# it does on actual text, not just on the benchmark (which is usually run with 
# gold tokens), we want to train from samples that have the runtime tokenization. So
# the Example object holds a pair (x, y), where the x is the input.
x = nlp.make_doc(y.text)
example = Example(x, y)
# Show the aligned gold-standard NER tags. These should have the entity as B-ORG L-ORG.
print(example.get_aligned_ner())

可以解释这一点的另一条信息是管道组件尝试处理部分注释,因此您可以拥有预设某些实体的规则。这就是当您将 Doc 完全注释为 x 时发生的情况——它将这些注释作为输入的一部分,并且当它试图构建最佳模型时,模型没有有效的操作要学习的动作序列。这种情况的可用性可以提高。