如何批量获取PyTorch张量的直方图?
How to get a Histogram of PyTorch tensors in batches?
有没有办法批量获取火炬张量的直方图?
例如:
x 是形状为 (64, 224, 224)
的张量
# x will have shape of (64, 256)
x = batch_histogram(x, bins=256, min=0, max=255)
不确定,但在我看来这是一件很难的事情,而且 PyTorch 没有开箱即用的东西。
直方图是一种统计运算。它本质上是离散的和不可微的。此外,它们本质上不可向量化。所以,我认为没有比普通的基于循环的解决方案更简单的方法了。
X = torch.rand(64, 224, 224)
h = torch.cat([torch.histc(x, bins=256, min=0, max=255) for x in X], 0)
如果谁有更好的解决方案,欢迎post。
可以用 torch.nn.functional.one_hot
在一行代码中做到这一点:
torch.nn.functional.one_hot(data_tensor, num_classes).sum(dim=-2)
基本原理是 one_hot
确实尊重批次,并且对于给定张量的最后一个维度中的每个值 v,创建一个填充为 0 的张量,v-th 分量除外,即 1。我们对所有此类 one-hot 编码求和,以获得 v 在最后 2 个维度(这是 tensor_data
中的最后一个维度)中每行数据中出现的次数。
此方法的一个可能严重缺点是内存使用,因为每个值都被扩展为大小为 num_classes
的张量(因此,tensor_data
的大小乘以 num_classes
).然而,这种内存使用是暂时的,因为 sum
再次折叠这个额外的维度,结果通常会小于 tensor_data
。我说“通常”是因为如果 num_classes
比 tensor_data
的最后一个维度的大小大得多,那么结果将相应地更大。
这是带有文档的代码,然后是 pytest 测试:
def batch_histogram(data_tensor, num_classes=-1):
"""
Computes histograms of integral values, even if in batches (as opposed to torch.histc and torch.histogram).
Arguments:
data_tensor: a D1 x ... x D_n torch.LongTensor
num_classes (optional): the number of classes present in data.
If not provided, tensor.max() + 1 is used (an error is thrown if tensor is empty).
Returns:
A D1 x ... x D_{n-1} x num_classes 'result' torch.LongTensor,
containing histograms of the last dimension D_n of tensor,
that is, result[d_1,...,d_{n-1}, c] = number of times c appears in tensor[d_1,...,d_{n-1}].
"""
return torch.nn.functional.one_hot(data_tensor, num_classes).sum(dim=-2)
def test_batch_histogram():
data = [2, 5, 1, 1]
expected = [0, 2, 1, 0, 0, 1]
run_test(data, expected)
data = [
[2, 5, 1, 1],
[3, 0, 3, 1],
]
expected = [
[0, 2, 1, 0, 0, 1],
[1, 1, 0, 2, 0, 0],
]
run_test(data, expected)
data = [
[[2, 5, 1, 1], [2, 4, 1, 1], ],
[[3, 0, 3, 1], [2, 3, 1, 1], ],
]
expected = [
[[0, 2, 1, 0, 0, 1], [0, 2, 1, 0, 1, 0], ],
[[1, 1, 0, 2, 0, 0], [0, 2, 1, 1, 0, 0], ],
]
run_test(data, expected)
def test_empty_data():
data = []
num_classes = 2
expected = [0, 0]
run_test(data, expected, num_classes)
data = [[], []]
num_classes = 2
expected = [[0, 0], [0, 0]]
run_test(data, expected, num_classes)
data = [[], []]
run_test(data, expected=None, exception=RuntimeError) # num_classes not provided for empty data
def run_test(data, expected, num_classes=-1, exception=None):
data_tensor = torch.tensor(data, dtype=torch.long)
if exception is None:
expected_tensor = torch.tensor(expected, dtype=torch.long)
actual = batch_histogram(data_tensor, num_classes)
assert torch.equal(actual, expected_tensor)
else:
with pytest.raises(exception):
batch_histogram(data_tensor, num_classes)
有没有办法批量获取火炬张量的直方图?
例如:
x 是形状为 (64, 224, 224)
# x will have shape of (64, 256)
x = batch_histogram(x, bins=256, min=0, max=255)
不确定,但在我看来这是一件很难的事情,而且 PyTorch 没有开箱即用的东西。
直方图是一种统计运算。它本质上是离散的和不可微的。此外,它们本质上不可向量化。所以,我认为没有比普通的基于循环的解决方案更简单的方法了。
X = torch.rand(64, 224, 224)
h = torch.cat([torch.histc(x, bins=256, min=0, max=255) for x in X], 0)
如果谁有更好的解决方案,欢迎post。
可以用 torch.nn.functional.one_hot
在一行代码中做到这一点:
torch.nn.functional.one_hot(data_tensor, num_classes).sum(dim=-2)
基本原理是 one_hot
确实尊重批次,并且对于给定张量的最后一个维度中的每个值 v,创建一个填充为 0 的张量,v-th 分量除外,即 1。我们对所有此类 one-hot 编码求和,以获得 v 在最后 2 个维度(这是 tensor_data
中的最后一个维度)中每行数据中出现的次数。
此方法的一个可能严重缺点是内存使用,因为每个值都被扩展为大小为 num_classes
的张量(因此,tensor_data
的大小乘以 num_classes
).然而,这种内存使用是暂时的,因为 sum
再次折叠这个额外的维度,结果通常会小于 tensor_data
。我说“通常”是因为如果 num_classes
比 tensor_data
的最后一个维度的大小大得多,那么结果将相应地更大。
这是带有文档的代码,然后是 pytest 测试:
def batch_histogram(data_tensor, num_classes=-1):
"""
Computes histograms of integral values, even if in batches (as opposed to torch.histc and torch.histogram).
Arguments:
data_tensor: a D1 x ... x D_n torch.LongTensor
num_classes (optional): the number of classes present in data.
If not provided, tensor.max() + 1 is used (an error is thrown if tensor is empty).
Returns:
A D1 x ... x D_{n-1} x num_classes 'result' torch.LongTensor,
containing histograms of the last dimension D_n of tensor,
that is, result[d_1,...,d_{n-1}, c] = number of times c appears in tensor[d_1,...,d_{n-1}].
"""
return torch.nn.functional.one_hot(data_tensor, num_classes).sum(dim=-2)
def test_batch_histogram():
data = [2, 5, 1, 1]
expected = [0, 2, 1, 0, 0, 1]
run_test(data, expected)
data = [
[2, 5, 1, 1],
[3, 0, 3, 1],
]
expected = [
[0, 2, 1, 0, 0, 1],
[1, 1, 0, 2, 0, 0],
]
run_test(data, expected)
data = [
[[2, 5, 1, 1], [2, 4, 1, 1], ],
[[3, 0, 3, 1], [2, 3, 1, 1], ],
]
expected = [
[[0, 2, 1, 0, 0, 1], [0, 2, 1, 0, 1, 0], ],
[[1, 1, 0, 2, 0, 0], [0, 2, 1, 1, 0, 0], ],
]
run_test(data, expected)
def test_empty_data():
data = []
num_classes = 2
expected = [0, 0]
run_test(data, expected, num_classes)
data = [[], []]
num_classes = 2
expected = [[0, 0], [0, 0]]
run_test(data, expected, num_classes)
data = [[], []]
run_test(data, expected=None, exception=RuntimeError) # num_classes not provided for empty data
def run_test(data, expected, num_classes=-1, exception=None):
data_tensor = torch.tensor(data, dtype=torch.long)
if exception is None:
expected_tensor = torch.tensor(expected, dtype=torch.long)
actual = batch_histogram(data_tensor, num_classes)
assert torch.equal(actual, expected_tensor)
else:
with pytest.raises(exception):
batch_histogram(data_tensor, num_classes)