按 ndarray 的值对索引进行采样的更快解决方案

Question

我有一些相当大的数组要处理。通过描述它们大，我的意思是像 (514, 514, 374) 的规模。我想根据其像素值随机获取一个索引。例如，我需要值等于 1 的像素的 3-d 索引。因此，我通过

列出所有可能性

indices = np.asarray(np.where(img_arr == 1)).T

这很完美，只是它运行得非常慢，到了无法忍受的程度，因为数组太大了。所以我的问题是有没有更好的方法来做到这一点？如果我可以输入一个像素值列表，然后得到一个相应索引的列表，那就更好了。例如，我想对这些像素值 [0, 1, 2] 的索引进行采样，然后返回索引列表 [[1,2,3], [53, 215, 11], [223, 42, 113]]
由于我正在处理医学图像，因此也欢迎使用 SimpleITK 的解决方案。欢迎留下您的意见，谢谢。

Answer 1

import numpy as np
value = 1
# value_list = [1, 3, 5] you can also use a list of values -> *
n_samples = 3
n_subset = 500

# Create a example array
img_arr = np.random.randint(low=0, high=5, size=(10, 30, 20))

# Choose randomly indices for the array
idx_subset = np.array([np.random.randint(high=s, size=n_subset) for s in x.shape]).T  
# Get the values at the sampled positions
values_subset = img_arr[[idx_subset[:, i] for i in range(img_arr.ndim)]]  
# Check which values match
idx_subset_matching_temp = np.where(values_subset == value)[0]
# idx_subset_matching_temp = np.argwhere(np.isin(values_subset, value_list)).ravel()  -> *
# Get all the indices of the subset with the correct value(s)
idx_subset_matching = idx_subset[idx_subset_matching_temp, :]  
# Shuffle the array of indices
np.random.shuffle(idx_subset_matching)  
# Only keep as much as you need
idx_subset_matching = idx_subset_matching[:n_samples, :]

这为您提供了所需的样本。这些样本的分布应该与您使用查看数组中所有匹配项的方法相同。在这两种情况下，您都会在具有匹配值的所有位置上获得均匀分布。

在选择子集的大小和所需的样本数时必须小心。子集必须足够大，以便有足够的值匹配，否则它将不起作用。如果您要采样的值非常稀疏，则会出现类似的问题，然后子集的大小需要非常大（在边缘情况下整个数组）并且您什么也得不到。

如果您经常从同一个数组中采样，那么存储每个值的索引也是一个好主意

indices_i = np.asarray(np.where(img_arr == i)).T

并将它们用于您的进一步计算。

按 ndarray 的值对索引进行采样的更快解决方案

Faster solution for sampling an index by value of ndarray

numpy

simpleitk