深度学习 - 如何为大型分类集准备训练数据？

Question

所以我有一大组 classes（现在说 500 个，可能会随着时间的推移而增加）。这些 classes 可以被认为是不同的域特定规则。

每条规则都有与之关联的特定类型的测试。我的数据是这样的：

Some text regarding Rule 1 ------> Rule 1
Some other text for Rule 1 ------> Rule 1
Some other other text for Rule 1 -----> Rule 1

Text regarding Rule 2 ----> Rule 2
Some other text regarding Rule 2 ----> Rule 2

你明白了。我有很多文本需要 class 化为规则。我开始使用的一种方法是使用 one hot encoded form 数据作为规则 classification.

这些是我遵循的步骤：

1. Create a Lexicon with all my Rule texts.
2. Create an array of 0s(of size of lexicon) for each line of text and turn on the index when the word is in lexicon.
3. Create a one hot encoded array (size = length(Rules)) with the index corresponding to Rule set to 1.
4. Feed this data to TensorFlow.
5. Test it out. I get prediction vector of size = length(Rules), 
   which gives me 1 for the index corresponding to the Rule the text was classified 
   into. I used tf.argmax()

到目前为止效果很好。我的问题是，当 class 大小增长到 1000、一万 class 等等时，这种方法是否有效。我还需要传递一个热编码矢量作为实际的 class 化吗？

有其他方法吗？

Answer 1

Is there an alternate way?

是的，您可以使用稀疏表示。您的标签将是 [0, num_classes-1] 范围内的整数，而不是单热向量，您需要应用 tf.nn.sparse_softmax_cross_entropy_with_logits 损失函数。

Answer 2

1000 个分类任务（4000，如果您考虑完整的 ILSVRC 数据集）类在图像识别中很常见 (ILSVRC)，并且证明在提供足够的训练数据的情况下工作得很好。

即便如此，至少有一篇论文显示使用相同模型设计的 ILSVRC 数据的分类精度从 1K 到 4K 类明显下降（97% -> 95%，可能）。

人脸识别研究 provides an example，其中增加类的数量（以及训练示例的数量）实际上会导致分类准确度的提高。他们在多达 10000 个不同的类.

上对其进行了测试

如果你超过 10K，那么你就该写一篇自己的论文了。

单热编码

下面的语法糖可能会帮助您避免手动构建 one-hot 向量，但本质上，一个 hot 向量仍然会作为交叉熵损失函数的输入存在。语法糖：

tf.nn.sparse_softmax_cross_entropy_with_logits

或

def to_one_hot(index, num_classes):
  res = np.zeros(num_classes)
  res[index] = 1
  return res

在交叉熵损失中使用one-hot vector来计算分类误差。使用稀疏向量的优点在于，即使训练示例已按输出 [0.7, 0.1, 0.1, 0.1] -> 0 正确分类，它仍然允许使用向量 [1.0, 0., 0., 0.]。这允许在没有提高分类精度的情况下进行梯度更新（即训练分类误差很小 <1%，普通分类误差只会在 100 个或更少的示例中产生梯度）。

您总是可以投入更多的劳动，对输入进行聚类，并为每个聚类训练分类器等等。它可能适合你，也可能不适合你。这是一个示例，其中类似的方法实际上提高了准确性 link。但是对于是否应该使训练复杂化似乎没有达成共识。使用神经网络，将更多时间投入到模型设计而不是数据工程上并尝试让网络为您处理一切可能是更好的主意。

深度学习 - 如何为大型分类集准备训练数据？

Deep Learning - How to prepare the training data for large classification set?

classification

machine-learning

text-classification

deep-learning

tensorflow

单热编码