TensorFlow 变换的正确用法 apply_buckets

Question

这是在 TensorFlow 1.11.0 上。 tft.apply_buckets 的 documentation 不是很好描述。具体来说，我读到： "bucket_boundaries: The bucket boundaries represented as a rank 2 Tensor."

我假设这必须是桶索引和桶边界？

当我尝试使用下面的玩具示例时：

import tensorflow as tf
import tensorflow_transform as tft
import numpy as np

tf.enable_eager_execution()

x = np.array([-1,9,19, 29, 39])
xt = tf.cast(
        tf.convert_to_tensor(x),
        tf.float32
        )

boundaries = tf.cast(
                tf.transpose(
                    tf.convert_to_tensor([[0, 1, 2, 3], [10, 20, 30, 40]])
                    ),
                tf.float32
                )

buckets = tft.apply_buckets(xt, boundaries)

我得到：

InvalidArgumentError: Expected sorted boundaries [Op:BucketizeWithInputBoundaries] name: assign_buckets

请注意，在本例中 x 和 bucket_boundaries 参数是：

tf.Tensor([-1.  9. 19. 29. 39.], shape=(5,), dtype=float32)
tf.Tensor(
[[ 0. 10.]
 [ 1. 20.]
 [ 2. 30.]
 [ 3. 40.]], shape=(4, 2), dtype=float32)

所以，bucket_boundaries 似乎不应该是索引和边界。有谁知道如何正确使用这个方法吗？

Answer 1

经过一番尝试，我发现 bucket_boundaries 应该是一个二维数组，其中的条目是存储桶的边界，并且数组被包裹起来，因此它有两列。请参见下面的示例：

import tensorflow as tf
import tensorflow_transform as tft
import numpy as np

tf.enable_eager_execution()

x = np.array([-1,9,19, 29, 39])
xt = tf.cast(
        tf.convert_to_tensor(x),
        tf.float32
        )

boundaries = tf.cast(
                tf.transpose(
                    tf.convert_to_tensor([[0, 20, 40, 60], [10, 30, 50, 70]])
                    ),
                tf.float32
                )

buckets = tft.apply_buckets(xt, boundaries)

因此，预期的输入是：

print (xt)
print (buckets)
print (boundaries)

tf.Tensor([-1.  9. 19. 29. 39.], shape=(5,), dtype=float32)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor(
[[ 0. 10.]
 [20. 30.]
 [40. 50.]
 [60. 70.]], shape=(4, 2), dtype=float32)

Answer 2

想要添加到此，因为这是 Google 搜索的唯一结果 "tft.apply_buckets" :)

我的示例在最新版本的云顶之弈中不起作用。以下代码对我有用。

请注意，桶被指定为 2 阶张量，但内部维度只有一个元素。

（我用错了词，但希望我下面的例子能澄清）

import tensorflow as tf
import tensorflow_transform as tft
import numpy as np

tf.enable_eager_execution()

xt = tf.cast(tf.convert_to_tensor(np.array([-1,9,19, 29, 39])),tf.float32)
bds = [[0],[10],[20],[30],[40]]
boundaries = tf.cast(tf.convert_to_tensor(bds),tf.float32)
buckets = tft.apply_buckets(xt, boundaries)

感谢您的帮助，因为这个答案让我完成了大部分工作！

剩下的我从TFT源码中找到的： https://github.com/tensorflow/transform/blob/deb198d59f09624984622f7249944cdd8c3b733f/tensorflow_transform/mappers.py#L1697-L1698

Answer 3

我喜欢这个答案，只是想添加一些简化，因为实际上不需要启用急切执行、强制转换和 numpy。请注意，下面针对浮点数情况的转换是通过将标量之一设为浮点数来完成的，tensorflow 对最高保真度数据类型进行了标准化。

下面的代码显示了这个映射是如何工作的。创建的bucket个数是bucket boundaries vector的长度+1，或者（我认为）更直观的是逗号的最小个数+2。加上2是因为负无穷到最小值，负无穷大到无穷大。如果桶边界上有东西，它会进入代表更大数字的桶。当桶边界未排序时会发生什么，留作 reader :)

的练习

import tensorflow as tf
import tensorflow_transform as tft
xt = tf.constant([-1., 9, 19, 29, 39, float('nan'), float('-inf'), float('inf')])
bucket_boundaries = tf.constant([[0], [10], [20], [30], [40]])
bucketed_floats = tft.apply_buckets(xt, bucket_boundaries)

for scalar, index in zip(xt, range(len(xt))):
    print(f"{scalar} was mapped to bucket {bucketed_floats[index]}.")

-1.0 was mapped to bucket 0.
9.0 was mapped to bucket 1.
19.0 was mapped to bucket 2.
29.0 was mapped to bucket 3.
39.0 was mapped to bucket 4.
nan was mapped to bucket 5.
-inf was mapped to bucket 0.
inf was mapped to bucket 5.

xt_int = tf.constant([-1, 9, 19, 29, 39, 41])
bucketed_ints = tft.apply_buckets(xt_int, bucket_boundaries)

for scalar, index in zip(xt_int, range(len(xt_int))):
    print(f"{scalar} was mapped to bucket {bucketed_ints[index]}.")

-1 was mapped to bucket 0.
9 was mapped to bucket 1.
19 was mapped to bucket 2.
29 was mapped to bucket 3.
39 was mapped to bucket 4.
41 was mapped to bucket 5.

请注意，还有一个名为 tft.bucketize 的函数似乎需要完全传递数据。我不是 100% 清楚 tft.apply_buckets 和 tft.bucketize 之间的细微差别。

TensorFlow 变换的正确用法 apply_buckets

Correct usage of TensorFlow Transform apply_buckets

python

tensorflow

tensorflow-transform