如何在大数组上使用 np.unique？

Question

我使用 tif 格式的地理空间图像。感谢 rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label 函数）。

我的想法是使用 numpy 的 unique 方法从这些补丁中检索信息，如下所示：

# identify the clumps
with rio.open(mask) as f:
    mask_raster = f.read(1)

class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True) 
del mask_raster
        
# identify the value
with rio.open(src) as f:
    src_raster = f.read(1)

src_flat = src_raster.flatten()
del src_raster 
    
values = [src_flat[index] for index in indices]
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

我的问题是：对于形状为 69940、70936 的图像（在我的磁盘上为 84.7 mB），np.unique 尝试在 int64 中分配一个相同暗淡的数组，但出现以下错误：

Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64

unique 将我的绘画重新格式化为 int64 是否正常？
是否可以强制使用更优化的格式？（即使我所有的补丁都是 1 个像素 np.int32 也足够了）
是否有其他使用我不知道的函数的解决方案？

Answer 1

uint64 数组可能是在 argsort here in the source code 期间分配的。

由于来自 scipy.ndimage.label 的标签是从零开始的连续整数，您可以使用 numpy.bincount:

num_features = np.max(mask_raster)
count = np.bincount(mask_raster, minlength=num_features+1)

要从 src 获取值，您可以执行以下赋值。它确实效率低下，但我认为它不会分配太多内存。

values = np.zeros(num_features+1, dtype=src_raster.dtype)
values[mask_raster] = src_raster

也许 scipy.ndimage 有一个更适合用例的函数。

Answer 2

我认为将 Numpy 数组拆分成更小的块并产生 unique:count 值将是内存效率高的解决方案，并将数据类型更改为 int16 或类似类型。

Answer 3

我深入研究了 scipy.ndimage 库并有效地找到了避免内存爆炸的解决方案。因为它切片初始光栅比我想象的要快：

from scipy import ndimage
import numpy as np 

# open the files 
with rio.open(mask) as f_mask, rio.open(src) as f_src: 
    mask_raster = f_mask.read(1)
    src_raster = f_src.read(1)
    
# use patches as slicing material 
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
    loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
    
    # the value of the patch is the value with the highest count 
    idx = np.argmax(loc_counts)
    counts.append(loc_counts[idx])
    values.append(loc_values[idx])
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

如何在大数组上使用 np.unique？

How to use np.unique on big arrays?

python

numpy

scipy

rasterio