删除 numpy 数组中的非三行

Question

我有一个包含 100 万行和 3 列的数组 A。最后一列中有唯一的整数，可帮助识别其他两列中的数据。我只想保留具有三个相同的唯一整数出现的数据，并删除具有其他数量的唯一整数出现的所有其他行（例如，对于只出现一次、两次或四次的唯一整数）。下面是我编写的用于处理此问题的函数 remove_loose_ends。但是这个函数被多次调用，是整个程序的瓶颈。是否有任何可能的增强功能可以从此操作中删除循环或以其他方式减少其运行时间？

import numpy as np
import time


def remove_loose_ends(A):
    # get unique counts
    unique_id, unique_counter = np.unique(A[:, 2], return_counts=True)
    # initialize outgoing indice mask
    good_index = np.array([[True] * (A.shape[0])])
    # loop through all indices and flip them to false if they match the not triplet entries
    for i in range(0, len(unique_id)):
        if unique_counter[i] != 3:
            good_index = good_index ^ (A[:, 2] == unique_id[i])
    # return incoming array with mask applied
    return A[np.squeeze(good_index), :]

# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)

# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)

Answer 1

因此，主要问题是您基本上将所有值循环两次，使其大致成为一个 n² 操作。你可以做的是直接从 numpy.unique 函数的输出创建一个布尔数组来为你做索引。

例如，像这样：

import numpy as np
import time


def remove_loose_ends(A):
    # get unique counts
    _, unique_inverse, unique_counter = np.unique(A[:, 2], return_inverse=True, return_counts=True)
    # Obtain boolean array of which integers occurred 3 times
    idx = unique_counter == 3

    # Obtain boolean array of which rows have integers that occurred 3 times
    row_idx = idx[unique_inverse]

    # return incoming array with mask applied
    return A[row_idx, :]

# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)

# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)

我尝试对两个版本进行计时。您发布的功能我在 15 分钟后停止，而我提供的功能在我的 PC 上大约需要 0.15 秒。

删除 numpy 数组中的非三行

Removing non-triple rows en masse in numpy array

python

optimization

numpy

unique

rows