如何在大数组上使用 np.unique?

How to use np.unique on big arrays?

我使用 tif 格式的地理空间图像。感谢 rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label 函数)。

我的想法是使用 numpyunique 方法从这些补丁中检索信息,如下所示:

# identify the clumps
with rio.open(mask) as f:
    mask_raster = f.read(1)

class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True) 
del mask_raster
        
# identify the value
with rio.open(src) as f:
    src_raster = f.read(1)

src_flat = src_raster.flatten()
del src_raster 
    
values = [src_flat[index] for index in indices]
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

我的问题是: 对于形状为 69940、70936 的图像(在我的磁盘上为 84.7 mB),np.unique 尝试在 int64 中分配一个相同暗淡的数组,但出现以下错误:

Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64

uint64 数组可能是在 argsort here in the source code 期间分配的。

由于来自 scipy.ndimage.label 的标签是从零开始的连续整数,您可以使用 numpy.bincount:

num_features = np.max(mask_raster)
count = np.bincount(mask_raster, minlength=num_features+1)

要从 src 获取值,您可以执行以下赋值。它确实效率低下,但我认为它不会分配太多内存。

values = np.zeros(num_features+1, dtype=src_raster.dtype)
values[mask_raster] = src_raster

也许 scipy.ndimage 有一个更适合用例的函数。

我认为将 Numpy 数组拆分成更小的块并产生 unique:count 值将是内存效率高的解决方案,并将数据类型更改为 int16 或类似类型。

我深入研究了 scipy.ndimage 库并有效地找到了避免内存爆炸的解决方案。 因为它切片初始光栅比我想象的要快:

from scipy import ndimage
import numpy as np 

# open the files 
with rio.open(mask) as f_mask, rio.open(src) as f_src: 
    mask_raster = f_mask.read(1)
    src_raster = f_src.read(1)
    
# use patches as slicing material 
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
    loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
    
    # the value of the patch is the value with the highest count 
    idx = np.argmax(loc_counts)
    counts.append(loc_counts[idx])
    values.append(loc_values[idx])
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})