如何在数组中的随机位置保持固定大小的唯一值,同时用掩码替换其他值?

How to keep a fixed size of unique values in random positions in an array while replacing others with a mask?

这可能是一个非常简单的问题,因为我仍在探索 Python。对于这个问题,我使用 numpy。 21 年 9 月 30 日更新: 采用并修改了如下所示的代码,以供将来参考。我还在循环中为 classes 添加了一个 elif,它的计数少于所需的大小。有些代码可能是不必要的。

new_array = test_array.copy()
uniques, counts = np.unique(new_array, return_counts=True)
print("classes:", uniques, "counts:", counts)
for unique, count in zip(uniques, counts):
    #print (unique, count)
    if unique != 0 and count > 3:
        ids = np.random.choice(count, count-3, replace=False)
        new_array[tuple(i[ids] for i in np.where(new_array == unique))] = 0
    elif unique != 0 and count <= 3:
        ids = np.random.choice(count, count, replace=False)
        new_array[tuple(i[ids] for i in np.where(new_array == unique))] = unique

以下为原题。

假设我有一个这样的二维数组:

test_array = np.array([[0,0,0,0,0],
                      [1,1,1,1,1],
                      [0,0,0,0,0],
                      [2,2,2,4,4],
                      [4,4,4,2,2],
                      [0,0,0,0,0]])
print("existing classes:", np.unique(test_array))
# "existing classes: [0 1 2 4]"

现在我想在每个 class 中保持 固定大小 (例如 2 个值)!= 0(在本例中为两个 1、两个 2 和两个 4s) 并将其余的替换为 0。其中 被替换的值是随机 每个 运行 (或来自种子)。

例如,运行 1 我将有

([[0,0,0,0,0],
[1,0,0,1,0],
[0,0,0,0,0],
[2,0,0,0,4],
[4,0,0,2,0],
[0,0,0,0,0]])

与另一个 运行 可能是

([[0,0,0,0,0],
[1,1,0,0,0],
[0,0,0,0,0],
[2,0,2,0,4],
[4,0,0,0,0],
[0,0,0,0,0]])

等谁能帮我解决这个问题?

这是我不太优雅的解决方案:

def unique(arr, num=2, seed=None):
    np.random.seed(seed)
    vals = {}
    for i, row in enumerate(arr):
        for j, val in enumerate(row):
            if val in vals and val != 0:
                vals[val].append((i, j))
            elif val != 0:
                vals[val] = [(i, j)]
    new = np.zeros_like(arr)
    for val in vals:
        np.random.shuffle(vals[val])
        while len(vals[val]) > num:
            vals[val].pop()
        for row, col in vals[val]:
            new[row,col] = val
    return new

我的策略是

  1. 创建一个初始化为全零的新数组
  2. 求每个class
  3. 中的元素
  4. 每个 class
    • 随机抽取两个元素保留
    • 将新数组的那些元素设置为 class 值

诀窍是保持索引的形状合适,这样您就可以保留原始数组的形状。

import numpy as  np
test_array = np.array([[0,0,0,0,0],
                      [1,1,1,1,1],
                      [0,0,0,0,0],
                      [2,2,2,4,4],
                      [4,4,4,2,2],
                      [0,0,0,0,0]])

def sample_classes(arr, n_keep=2, random_state=42):
    classes, counts = np.unique(test_array, return_counts=True)
    rng = np.random.default_rng(random_state)
    out = np.zeros_like(arr)
    for klass, count in zip(classes, counts):
        # Find locations of the class elements
        indexes = np.nonzero(arr == klass)
        # Sample up to n_keep elements of the class
        keep_idx = rng.choice(count, n_keep, replace=False)
        # Select the kept elements and reformat for indexing the output array and retaining its shape
        keep_idx_reshape = tuple(ind[keep_idx] for ind in indexes)
        out[keep_idx_reshape] = klass
    return out

你可以像这样使用它

In [3]: sample_classes(test_array)                                                                                                                                                                         [3/1174]
Out[3]:
array([[0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0],
       [0, 0, 0, 0, 0],
       [2, 0, 0, 4, 0],
       [4, 0, 0, 2, 0],
       [0, 0, 0, 0, 0]])

In [4]: sample_classes(test_array, n_keep=3)
Out[4]:
array([[0, 0, 0, 0, 0],
       [1, 0, 1, 1, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 4, 0],
       [4, 4, 0, 2, 2],
       [0, 0, 0, 0, 0]])

In [5]: sample_classes(test_array, random_state=88)
Out[5]:
array([[0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [4, 0, 4, 2, 2],
       [0, 0, 0, 0, 0]])

In [6]: sample_classes(test_array, random_state=88, n_keep=4)
Out[6]:
array([[0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0],
       [2, 2, 0, 4, 4],
       [4, 4, 0, 2, 2],
       [0, 0, 0, 0, 0]])

下面的数组大小应该是O(n log n)

def keep_k_per_class(data,k,rng):
    out = np.zeros_like(data)
    unq,cnts = np.unique(data,return_counts=True)
    assert (cnts >= k).all()
    # calculate class boundaries from class sizes
    CNTS = cnts.cumsum()
    # indirectly group classes together by partial sorting
    idx = data.ravel().argpartition(CNTS[:-1])
    # the following lines implement simultaneous drawing without replacement
    # from all classes

    # lower boundaries of intervals to draw random numbers from
    # for each class they start with the lower class boundary 
    # and from there grow one by one - together with the
    # swapping out below this implements "without replacement"
    lb = np.add.outer(np.arange(k),CNTS-cnts)
    pick = rng.integers(lb,CNTS,lb.shape)
    for l,p in zip(lb,pick):
        # populate output array
        out.ravel()[idx[p]] = unq
        # swap out used indices so still available ones occupy a linear
        # range (per class)
        idx[p] = idx[l]
    return out

示例:

rng = np.random.default_rng()
>>> 
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [2, 0, 2, 0, 4],
       [0, 4, 0, 0, 0],
       [0, 0, 0, 0, 0]])
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0],
       [4, 0, 4, 0, 2],
       [0, 0, 0, 0, 0]])

还有一个大的

>>> BIG = np.add.outer(np.tile(test_array,(100,100)),np.arange(0,500,5))
>>> BIG.size
30000000
>>> res = keep_k_per_class(BIG,30,rng)
### takes ~4 sec

### check
>>> np.unique(np.bincount(res.ravel()),return_counts=True)
(array([       0,       30, 29988030]), array([100, 399,   1]))