如何批量获取PyTorch张量的直方图？

Question

有没有办法批量获取火炬张量的直方图？

例如： x 是形状为 (64, 224, 224)

的张量

# x will have shape of (64, 256)
x = batch_histogram(x, bins=256, min=0, max=255)

Answer 1

不确定，但在我看来这是一件很难的事情，而且 PyTorch 没有开箱即用的东西。

直方图是一种统计运算。它本质上是离散的和不可微的。此外，它们本质上不可向量化。所以，我认为没有比普通的基于循环的解决方案更简单的方法了。

X = torch.rand(64, 224, 224)
h = torch.cat([torch.histc(x, bins=256, min=0, max=255) for x in X], 0)

如果谁有更好的解决方案，欢迎post。

Answer 2

可以用 torch.nn.functional.one_hot 在一行代码中做到这一点：

torch.nn.functional.one_hot(data_tensor, num_classes).sum(dim=-2)

基本原理是 one_hot 确实尊重批次，并且对于给定张量的最后一个维度中的每个值 v，创建一个填充为 0 的张量，v-th 分量除外，即 1。我们对所有此类 one-hot 编码求和，以获得 v 在最后 2 个维度（这是 tensor_data 中的最后一个维度）中每行数据中出现的次数。

此方法的一个可能严重缺点是内存使用，因为每个值都被扩展为大小为 num_classes 的张量（因此，tensor_data 的大小乘以 num_classes).然而，这种内存使用是暂时的，因为 sum 再次折叠这个额外的维度，结果通常会小于 tensor_data。我说“通常”是因为如果 num_classes 比 tensor_data 的最后一个维度的大小大得多，那么结果将相应地更大。

这是带有文档的代码，然后是 pytest 测试：

def batch_histogram(data_tensor, num_classes=-1):
    """
    Computes histograms of integral values, even if in batches (as opposed to torch.histc and torch.histogram).
    Arguments:
        data_tensor: a D1 x ... x D_n torch.LongTensor
        num_classes (optional): the number of classes present in data.
                                If not provided, tensor.max() + 1 is used (an error is thrown if tensor is empty).
    Returns:
        A D1 x ... x D_{n-1} x num_classes 'result' torch.LongTensor,
        containing histograms of the last dimension D_n of tensor,
        that is, result[d_1,...,d_{n-1}, c] = number of times c appears in tensor[d_1,...,d_{n-1}].
    """
    return torch.nn.functional.one_hot(data_tensor, num_classes).sum(dim=-2)

def test_batch_histogram():
    data = [2, 5, 1, 1]
    expected = [0, 2, 1, 0, 0, 1]
    run_test(data, expected)

    data = [
        [2, 5, 1, 1],
        [3, 0, 3, 1],
    ]
    expected = [
        [0, 2, 1, 0, 0, 1],
        [1, 1, 0, 2, 0, 0],
    ]
    run_test(data, expected)

    data = [
        [[2, 5, 1, 1], [2, 4, 1, 1], ],
        [[3, 0, 3, 1], [2, 3, 1, 1], ],
    ]
    expected = [
        [[0, 2, 1, 0, 0, 1], [0, 2, 1, 0, 1, 0], ],
        [[1, 1, 0, 2, 0, 0], [0, 2, 1, 1, 0, 0], ],
    ]
    run_test(data, expected)


def test_empty_data():
    data = []
    num_classes = 2
    expected = [0, 0]
    run_test(data, expected, num_classes)

    data = [[], []]
    num_classes = 2
    expected = [[0, 0], [0, 0]]
    run_test(data, expected, num_classes)

    data = [[], []]
    run_test(data, expected=None, exception=RuntimeError)  # num_classes not provided for empty data


def run_test(data, expected, num_classes=-1, exception=None):
    data_tensor = torch.tensor(data, dtype=torch.long)

    if exception is None:
        expected_tensor = torch.tensor(expected, dtype=torch.long)
        actual = batch_histogram(data_tensor, num_classes)
        assert torch.equal(actual, expected_tensor)
    else:
        with pytest.raises(exception):
            batch_histogram(data_tensor, num_classes)

如何批量获取PyTorch张量的直方图？

How to get a Histogram of PyTorch tensors in batches?

python

histogram

pytorch

tensor