在张量流图中执行查找

Question

我有一个具有以下架构的 TF 数据集：

tf_features = {
 'searched_destination_ufi': tf.io.FixedLenFeature([], tf.int64, default_value=0),
 'booked_hotel_ufi': tf.io.FixedLenFeature([], dtype=tf.int64, default_value=0),
 'user_id': tf.io.FixedLenFeature([], dtype=tf.int64, default_value=0),;
}

我也有这样的字典：

candidates = {'111': [123, 444, ...], '222': [555, 888, ...]...}

我想通过以下方式执行地图操作：


ds.map(lambda x, y: {**x, 'candidates': candidates[x['searched_destination_ufi'].numpy()]})

但是我总是得到：AttributeError: 'Tensor' object has no attribute 'numpy'

当我删除 .numpy() 我得到 TypeError: Tensor is unhashable. Instead, use tensor.ref() as the key.

您有什么解决方案建议吗？

Answer 1

函数 dataset.map 在图形模式下工作，无法在张量上调用 .numpy()。您可以尝试使用 tf.py_function 将候选人 dict 包含到您的数据集中：

import tensorflow as tf

tf_features = {
 'searched_destination_ufi': ['111', '222'],
 'booked_hotel_ufi': [2, 4],
 'user_id': [3, 2]
}

ds = tf.data.Dataset.from_tensor_slices(tf_features)

candidates = {'111': [123, 444], '222': [555, 888]}

def py_func(x):
  x = x.numpy().decode('utf-8')
  return candidates[x]


ds = ds.map(lambda x: {**x, 'candidates': tf.py_function(py_func, [x['searched_destination_ufi']], [tf.int32]*2)})
for x in ds:
  print(x)

{'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'111'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'candidates': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([123, 444], dtype=int32)>}
{'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'222'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=4>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'candidates': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([555, 888], dtype=int32)>}

请注意 [tf.int32]*2 对应于 candidates 中列表的长度。

对于更复杂的方法，您可以使用 tf.lookup.StaticHashTable 和 tf.gather，它们都可以在图形模式下工作：

import tensorflow as tf

tf_features = {
 'searched_destination_ufi': ['111', '222'],
 'booked_hotel_ufi': [2, 4],
 'user_id': [3, 2]
}

ds = tf.data.Dataset.from_tensor_slices(tf_features)

candidates = {'111': [123, 444], '222': [555, 888]}
keys = list(candidates.keys())
values = tf.constant(list(candidates.values()))

table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(tf.constant(keys), tf.range(len(keys))),
    default_value=-1)

ds = ds.map(lambda x: {**x, 'candidates': tf.gather(values, [table.lookup(x['searched_destination_ufi'])])})
for x in ds:
  print(x)

{'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'111'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'candidates': <tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[123, 444]], dtype=int32)>}
{'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'222'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=4>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'candidates': <tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[555, 888]], dtype=int32)>}

如果 candidates 字段是可变长度的，使用参差不齐的张量和第二种方法，其余代码保持不变：

candidates = {'111': [123, 444], '222': [555, 888, 323]}
keys = list(candidates.keys())
values = tf.ragged.constant(list(candidates.values()))

{'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'111'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'candidates': <tf.RaggedTensor [[123, 444]]>}
{'searched_destination_ufi': <tf.Tensor: shape=(), dtype=string, numpy=b'222'>, 'booked_hotel_ufi': <tf.Tensor: shape=(), dtype=int32, numpy=4>, 'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'candidates': <tf.RaggedTensor [[555, 888, 323]]>}

在张量流图中执行查找

Perform lookup in tensorflow map

python

tensorflow

tensorflow-datasets

tensorflow2.0