Tensorflow 中的一种热编码

Question

我一直在按照 tensorflow 演练 here 创建我自己的分类 OHE 层。建议的层在下面，我已经非常严格地按照指南的前面步骤操作：

 def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a StringLookup layer which will turn strings into integer indices
  if dtype == 'string':
    index = preprocessing.StringLookup(max_tokens=max_tokens)
  else:
    index = preprocessing.IntegerLookup(max_tokens=max_tokens)

  # Prepare a Dataset that only yields our feature
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Create a Discretization for our integer indices.
  encoder = preprocessing.CategoryEncoding(num_tokens=index.vocabulary_size())

  # Apply one-hot encoding to our indices. The lambda function captures the
  # layer so we can use them, or include them in the functional model later.
  return lambda feature: encoder(index(feature))

但是输出与指南不一致。当我对该层的输入是一个包含 n 个字符串的列表，而不是输出形状（n，词汇量）时，我收到形状为（1，词汇量）的输出，其中多个类别错误地标记为“1”。例如使用 n=2 和词汇量=3 我得到的不是 [[1, 0, 0], [0, 1, 0]] 的 OHE，而是 [1, 1, 0].

我的代码与指南完全相同，但看起来该层正在“合并”我输入的每个元素的编码。他们提供的图层是否有问题，或者有人可以指导我可以测试什么？

Answer 1

默认情况下，CategoryEncoding 使用 output_mode="multi_hot"。这就是您获得大小输出 (1, vocab_size) 的原因。要获得大小为 (n, vocab_size) 的 OHE，请在您的代码中进行此更改

encoder = preprocessing.CategoryEncoding(num_tokens=index.vocabulary_size(), output_mode='one_hot')

Tensorflow 中的一种热编码

One Hot Encoding in Tensorflow

python

deep-learning

tensorflow

one-hot-encoding