用于文本分类任务的 NLP 数据准备和排序

NLP data preparation and sorting for text-classification task

我在网上阅读了很多教程和有关 Whosebug 的主题,但有一个问题对我来说仍然是一头雾水。如果只考虑多标签训练的收集数据阶段,哪种方式(见下文)更好,这两种方式是否都可以接受和有效?

  1. 不惜一切代价寻找 'pure' 个单标签示例。
  2. 每个示例都可以被多重标记。

例如,我有关于war、政治、经济、文化的文章。通常,政治与经济挂钩,war与政治挂钩,经济问题可能会出现在文化文章等中。我可以严格地为每个示例指定一个主题,并放弃不确定的作品或指定 2、3 个主题。

我将使用 Spacy 训练数据,每个主题的数据量约为 5-10,000 个示例。

对于相关讨论的任何解释 and/or 和 link,我将不胜感激。

您可以尝试OneVsAll / OneVsRest策略。这将使您能够做到这两点:准确预测一个类别,而无需严格分配一个标签。

Also known as one-vs-all, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy for multiclass classification and is a fair default choice.

This strategy can also be used for multilabel learning, where a classifier is used to predict multiple labels for instance, by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.

Link 到文档: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html