Explanation/interpretation spaCy 配置文件中的参数

Explanation/interpretation of the parameters in the spaCy config file

关于我们在 config.cfg 文件中定义的参数,我有几个问题。尽管 spaCy 的文档确实试图解释它们,但我觉得解释的描述不够充分,而且文档中散布着很多东西,因此很难准确找到你需要的东西,尤其是 spaCy v3,(除非我'我在看网站的错误部分),这是最近的,因此在论坛中 question/answers 确实很少。 我基本上是在构建一个命名实体识别 (NER) 模型和一个转换器组件。我的问题如下:

  1. 在下面的部分(corpora.train也有同样的问题),max_lengthlimit有什么区别?

    对于 max_length,文档说“训练文档长度的限制”
    对于 limit,文档说“训练示例数量的限制”

    他们不是差不多一样的东西吗?我的意思是我可以通过限制文档本身的长度来限制训练示例的数量,对吗?

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
  1. 在下面的代码片段中,一个 'step' 是什么意思?我理解 max_steps=0 意味着无限的步骤。但是我怎么知道有多少这样的 'steps' 构成一个纪元呢?另外,1个这样的步骤涵盖了多少个例句?
[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 10
max_steps = 0
eval_frequency = 200
frozen_components = []
before_to_disk = null
  1. 在训练过程中,下面代码片段中的 learn_rate 究竟是如何被修改的?更具体地说,total_stepswarmup_steps 是什么意思?
[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 200
initial_rate = 0.00005
  1. 最后,在训练过程的CLI输出中,这个'#'到底是什么? GitHub discussions 中提到 “# 列是优化步骤数(= 已处理的批次)”,但这 1 批次或 'optimization step'?如果训练过程向我显示 200 之后的分数,例如 'batches' 我该如何解释它(比如到那时已经处理了多少例句)?

In the following part (same question for corpora.train also), what is the difference between max_length and limit? For max_length the docs say "Limitations on training document length" For limit, the docs say "Limitation on number of training examples" Aren't they both more or less the same thing? I mean I can limit the number of training examples by limiting the document's length itself, right?

这是不同的东西,您似乎对什么是“文档”感到困惑。您可以将“文档”视为 spaCy 中的单个对象。不同的文档彼此之间一无所知。文档基于单个字符串。以普通 Python 字符串为例:

["cat", "dog", "fish"] # this is three strings
["cat dog fish"] # this is one string

你可以看到“从列表中取出三个字符串”和“取出长度不超过三个字符的字符串”是截然不同的事情。 spaCy 中的值就是这样。

In the below snippet, what is the meaning of one 'step'? I understand max_steps=0 means infinite steps. But how do I know how many such 'steps' make one epoch? Also how many example sentences are covered in 1 such step?

一个“步”就是一个“批”。 “批处理”是 运行 对一些示例进行训练并更新一次模型权重。您可以控制批处理的大小,因此它可以是任意数量的示例。一个“epoch”是训练看到每个示例一次所花费的时间,因此如果每批有 5 个文档和 30 个训练文档,那么 6 个步骤就是一个 epoch。

spaCy 不一定了解训练中的“句子”,docs 是 batch 的基本单位。您的训练示例可能都是单句,但这不是必需的。

这些术语不是 spaCy 特有的,它们广泛用于机器学习。

How exactly is the learn_rate being modified in the below snippet of code, during the training process? More specifically, what do total_steps and warmup_steps mean?

这是来自 Thinc,see the docs there

引用:

Generate a series, starting from an initial rate, and then with a warmup period, and then a linear decline. Used for learning rates.

total_steps 结束时,学习率停止变化。

Finally, in the CLI output of the training process, What exactly is this '#'? It was mentioned in one of GitHub discussions that "The # column is the number of optimization steps (= batches processed)" , but what exactly is this 1 batch or 'optimization step'? If the training process shows me the scores for after 200 such 'batches' how do I interpret it (as in how many example sentences have been processed till that point)?

一步和#2一样,是一批。批量大小在文档中表示,而不是在句子中。