重新训练 spaCy 的 NER v1.8.2 - 训练量和实体类型的混合

Re-Training spaCy's NER v1.8.2 - Training Volume and Mix of Entity Types

我正在(重新)训练 spaCy 的命名实体识别器,有一些疑问,我希望更有经验的 researcher/practitioner 可以帮助我弄清楚:

  1. 如果考虑几百个示例 'a good starting point',那么合理的目标数量是多少? 100 000 entity/label 过多吗?
  2. 如果我引入一个新标签,在训练过程中该标签的实体数量是否最好大致相同(平衡)?
  3. 关于'examples of other entity types'中的混合:

    • 我是否只将随机已知 categories/labels 添加到我的训练集中,例如:('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], )

    • 我可以为不同的标签使用相同的文本吗?例如('The Business Standard published in its recent issue on crude oil and natural gas ...', [(55,64, 'COMMODITY')], )?

    • 在类似的注释中,假设我希望 spaCyto 也能识别第二个 COMMODITY 然后我可以只使用相同的句子并标记不同的区域,例如('The Business Standard published in its recent issue on crude oil and natural gas ...', [(69,80, 'COMMODITY')], )?就应该这样吗?

    • 新标签和其他(旧)标签之间的什么比例被认为是合理的

谢谢

PS 我正在使用 spaCy 1.8.2

在 Ubuntu 16.04 中使用 Python2.7

Matthew Honnibal check out issue 1054 on spaCy's github page 提供完整答案。以下是与我的问题相关的最重要的几点:

Question(Q) 1: If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?

Answer(A): Every machine learning problem will have a different examples/accuracy curve. You can get an idea for this by training with less data than you have, and seeing what the curve looks like. If you have 1,000 examples, then try training with 500, 750, etc, and see how that affects your accuracy.

Q 2:如果我引入一个新的标签,在训练过程中该标签的实体数量是否大致相同(平衡)是不是最好?

A: There's trade-off between making the gradients too sparse, and making the learning problem too unrepresentative of what the actual examples will look like.

Q 3:关于'examples of other entity types'中的混合:

  • 我是否只将随机已知 categories/labels 添加到我的训练集中:

A: 不,应该注释该文本中的所有实体,因此上面的示例:('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], ) 应该是 ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG'), (55,64, 'COMMODITY'), (69,80, 'COMMODITY')], )

  • can I use the same text for various labels?:

A: 不是给出示例的方式。请参阅上一个答案。

  • what ratio between new and other (old) labels is considered reasonable?:

A: 查看答案 Q 2.


PS:双重引用是直接引用 github 问题答案。