从多维数组中提取特征索引

Question

我有一个数组表示三维空间中的云水浓度值 space。在云水浓度高于某个阈值的地方，我说我有云（见下面的横截面）。大部分地区是干燥的，但大部分地区有层积云，底部在 400 米左右。

我想做的是提取云底和云顶的 (x,y,z) 坐标。然后我想在代表风速垂直分量的不同三维数组上使用这些坐标来获得云底的上升气流。

我现在正在做的事情有效但速度很慢。我觉得一定有办法利用 NumPy 来加速它。

这就是我现在正在做的事情：

# 3d array representing cloud water at a particular timestep t
qc = QC(t)

# get the coordinates where there is cloud
cloud_coords = argwhere( qc > qc_thresh )

# Arrays to hold the z values of cloud base (cb) and cloud top (ct)
zcb = zeros((nx,ny))
zct = zeros((nx,ny))

# Since each coordinate (x,y) will in general have multiple z values
# for cloud I have to loop over all (x,y) and
# pull out max and min height for each point (x,y)
for x in range(nx):
    # Pull out all the coordinates with a given x value
    xslice = cloud_coords[ where(cloud_coords[:,0] == x) ]

    for y in range(ny):       
        # for the given x value select a particular y value
        column = xslice[ where(xslice[:,1] == y) ]

        try:
            zcb[x,y] = min( column[:,2] )
            zct[x,y] = max( column[:,2] )
        except:
            # Because there may not be any cloud at all
            # (a "hole") we fill the array with an average value
            zcb[x,y] = mean(zcb[zcb.nonzero()])
            zct[x,y] = mean(zct[zct.nonzero()])


# Because I intend to use these as indices I need them to be ints
zcb = array(zcb, dtype='int')
zct = array(zct, dtype='int')

输出是一个包含云底（和云顶）z 坐标的二维数组

然后我在另一个数组上使用这些索引来获取变量，例如云底的风速：

wind = W(t)
j,i = meshgrid(arange(ny),arange(nx))
wind_base = wind[i,j,zcb]

我在模拟中对许多时间步执行此操作，最慢的部分是 python 遍历所有 (x,y) 坐标的循环。非常感谢任何有关使用 NumPy 更快地提取这些值的帮助！

Answer 1

您怀疑 numpy 可以很好地解决您的问题是正确的。实际上，您正在做很多低效率的事情，例如在末尾使用 np.array() 显式创建一个新数组，而 dtype 的 int 是 int 中的一个复杂对象 python 3.

您可以在几行向量化的 numpy 中完成大部分工作。这个想法是，找到云出现或云结束的索引（沿高度轴）就足够了。我们可以使用 numpy.argmax 以矢量化的方式做到这一点。这确实是有效解决方案的核心：

import numpy as np
import matplotlib.pyplot as plt

# generate dummy data
qc_thresh = 0.6
nx,ny,nz = 400,400,100
qc = np.zeros((nx,ny,nz))
# insert random cloud layer
qc[...,50:80] = np.random.rand(nx,ny,30)
# insert holes in clouds for completeness
qc[np.random.randint(nx,size=2*nx),np.random.randint(ny,size=2*nx),:] = 0

def compute_cloud_boundaries():
    cloud_arr = qc > qc_thresh

    # find boundaries by making use of np.argmax returning first maximum
    zcb = np.argmax(cloud_arr,axis=-1)
    zct = nz - 1 - np.argmax(cloud_arr[...,::-1],axis=-1)

    # logical (nx,ny)-shaped array where there's a cloud
    cloud_inds = (zcb | (zct!=nz-1)).astype(bool)
    # this is short for `(zcb==0) | (zct!=nz-1)`

    # fill the rest with the mean
    zcb[np.logical_not(cloud_inds)] = zcb[cloud_inds].mean()
    zct[np.logical_not(cloud_inds)] = zct[cloud_inds].mean()

    return zcb,zct

我根据你的方法检查了上面的内容（完成了相应的小例子），它给出了完全相同的结果。正如我所说，这个想法是 cloud_arr = qc > qc_thresh 是一个逻辑数组，告诉我们哪里的湿度大到足以形成云。然后我们沿着最后一个（高度）轴查看这个（本质上是二元的！）矩阵的最大值。调用 np.argmax 将告诉我们每个平面 2d 索引的第一个（最底部）高度值。为了到达云顶，我们需要反转我们的数组并从另一边做同样的事情（负责转换回结果索引）。反转数组创建一个视图而不是一个副本，所以这也很有效。最后，我们修正没有云的地方；代替更好的约束，我们检查 argmax 返回的最高索引对应于边缘点的位置。考虑到真实的天气数据，我们可以确定最底部和最顶部的测量结果不对应于云，因此这应该是一个安全标准。

这里是展示的虚拟数据的横截面：

上述400x400x100案例的非代表性时间安排：

In [24]: %timeit compute_cloud_boundaries()
10 loops, best of 3: 29.1 ms per loop

In [25]: %timeit orig() # original loopy version from the question
1 loop, best of 3: 9.37 s per loop

这似乎是速度提升了 300 多倍。当然你的实际用例将是对这种方法的适当测试，但它应该没问题。

至于索引步骤，您可以通过为索引使用开放网格并利用数组广播来节省一些内存。不必分配额外的 (nx,ny) 形数组也可能会加快此步骤：

wind = W(t)
i,j = np.ogrid[:nx,:ny]
wind_base = wind[i,j,zcb]

如您所见，np.ogrid 创建了一个形状为 (nx,1) 和 (1,ny) 的开放网格，它们一起广播到相当于 meshgrid 调用的内容。

从多维数组中提取特征索引

Extracting indices of feature from multidimensional array

python

arrays

numpy

vectorization

array-broadcasting