应用没有 tf.Estimator 的特征列 (Tensorflow 2.0.0-rc0)

Question

在 Tensorflow tf.Estimator 和 tf.feature_column 文档中有详细记录，如何将特征列与 Estimator 一起使用，例如为了对正在使用的数据集中的分类特征进行一次性编码。

但是，我想 "apply" 我的特征列直接到我从 .csv 文件创建的 tf.dataset（有两列：UserID、MovieID），甚至没有定义模型或估算器。（原因：我想检查我的数据管道中究竟发生了什么，即我希望能够通过我的管道运行一批样本，然后在输出中查看这些特征是如何编码的。）

这是我目前尝试过的方法：

column_names = ['UserID', 'MovieID']

user_col = tf.feature_column.categorical_column_with_hash_bucket(key='UserID', hash_bucket_size=1000)
movie_col = tf.feature_column.categorical_column_with_hash_bucket(key='MovieID', hash_bucket_size=1000)
feature_columns = [tf.feature_column.indicator_column(user_col), tf.feature_column.indicator_column(movie_col)]

feature_layer = tf.keras.layers.DenseFeatures(feature_columns=feature_columns)

def process_csv(line):
  fields = tf.io.decode_csv(line, record_defaults=[tf.constant([], dtype=tf.int32)]*2, field_delim=";")
  features = dict(zip(column_names, fields))

  return features 

ds = tf.data.TextLineDataset(csv_filepath)
ds = ds.map(process_csv, num_parallel_calls=4)
ds = ds.batch(10)
ds.map(lambda x: feature_layer(x))

然而，地图调用的最后一行引发了以下错误：

ValueError: Column dtype and SparseTensors dtype must be compatible. key: MovieID, column dtype: , tensor dtype:

我不确定这个错误是什么意思... 我还尝试用我定义的 feature_layer 定义一个 tf.keras 模型，然后在我的数据集上定义运行 .predict() - 而不是使用 ds.map(lambda x: feature_layer(x)):

model = tf.keras.Sequential([feature_layer])
model.compile()
model.predict(ds)

但是，这会导致与上面完全相同的错误。有人知道出了什么问题吗？是否有更简单的方法来实现这一目标？

Answer 1

刚发现问题： tf.feature_column.categorical_column_with_hash_bucket() 采用可选参数 dtype，默认情况下设置为 tf.dtypes.string。但是，我列的数据类型是数字 (tf.dtypes.int32)。这解决了问题：

tf.feature_column.categorical_column_with_hash_bucket(key='UserID', hash_bucket_size=1000, dtype=tf.dtypes.int32)

应用没有 tf.Estimator 的特征列 (Tensorflow 2.0.0-rc0)

Apply feature columns without tf.Estimator (Tensorflow 2.0.0-rc0)

python

tensorflow

tensorflow-datasets