如何根据样本相对于已经 select 编辑的样本的距离 select 数值样本 (Python)

Question

我有一些形状为 (500,2) 的二维数组中的随机测试数据：

xy = np.random.randint(low=0.1, high=1000, size=[500, 2])

从这个数组中，我先select 10个随机样本，到select第11个样本，我想挑离原来的10个最远的样本select 集体编辑样本，我正在使用欧几里德距离来做到这一点。我需要继续这样做，直到选择了一定数量。这是我的尝试。

# Function to get the distance between samples
def get_dist(a, b):

    return np.sqrt(np.sum(np.square(a - b)))


# Set up variables and empty lists for the selected sample and starting samples
n_xy_to_select = 120
selected_xy = []
starting = []


# This selects 10 random samples and appends them to selected_xy
for i in range(10):
    idx = np.random.randint(len(xy))
    starting_10 = xy[idx, :]
    selected_xy.append(starting_10)
    starting.append(starting_10)
    xy = np.delete(xy, idx, axis = 0)
starting = np.asarray(starting)


# This performs the selection based on the distances
for i in range(n_xy_to_select - 1):
# Set up an empty array dists
    dists = np.zeros(len(xy))
    for selected_xy_ in selected_xy:
        # Get the distance between each already selected sample, and every other unselected sample
        dists_ = np.array([get_dist(selected_xy_, xy_) for xy_ in xy])
        # Apply some kind of penalty function - this is the key
        dists_[dists_ < 90] -= 25000
        # Sum dists_ onto dists
        dists += dists_
    # Select the largest one
    dist_max_idx = np.argmax(dists)
    selected_xy.append(xy[dist_max_idx])
    xy = np.delete(xy, dist_max_idx, axis = 0)

关键是这一行——惩罚函数

dists_[dists_ < 90] -= 25000

此惩罚函数的存在是为了防止代码通过人为缩短靠近的值来仅在 space 的边缘选择一圈样本。然而，这最终会崩溃，并且 selection 开始聚集，如图所示。您可以清楚地看到，在需要任何类型的聚类之前，代码可以生成更好的 selection。我觉得一种衰减指数函数最适合这个，但我不知道如何实现它。所以我的问题是；我将如何更改当前的惩罚函数以获得我正在寻找的东西？

Answer 1

根据你的问题，我了解到你正在寻找的是周期性边界条件（PBC）。这意味着 space 左边缘的点紧挨着右端。因此，沿着一个轴可以获得的最大距离由框的一半（即边缘和中心之间）给出。

要考虑到 PBC，您需要计算每个轴上的距离并减去方框的一半：例如，如果您有一个 x1 = 100 的点和第二个 x2 = 900 的点，使用 PBC，它们相隔 200 个单位：|x1 - x2| - 500. 在一般情况下，给定 2 个坐标和半尺寸框，您最终得到：

$\Delta x = |x_1 - x_2| - \frac{1}{2} \left[|x_1 - x_2| % box_\mathrm{size}\right] box_\mathrm{size}$

在你的例子中，这简化为：

delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500

总结一下，我使用新的 distance 函数重写了您的代码（请注意，我删除了一些不必要的 for 循环）：

import numpy as np

def distance(p, arr, 500):
    delta_x = np.abs(p[0] - arr[:,0])
    delta_y = np.abs(p[1] - arr[:,1])
    delta_x[delta_x > 500] = delta_x[delta_x > 500] - 500
    delta_y[delta_y > 500] = delta_y[delta_y > 500] - 500
    return np.sqrt(delta_x**2 + delta_y**2)

xy = np.random.randint(low=0.1, high=1000, size=[500, 2])
idx = np.random.randint(500, size=10)
selected_xy = list(xy[idx])
_initial_selected = xy[idx]
xy = np.delete(xy, idx, axis = 0)
n_xy_to_select = 120


for i in range(n_xy_to_select - 1):
    # Set up an empty array dists
    dists = np.zeros(len(xy))
    for selected_xy_ in selected_xy:
        # Compute the distance taking into account the PBC
        dists_ = distance(selected_xy_, xy)
        dists += dists_
    # Select the largest one
    dist_max_idx = np.argmax(dists)
    selected_xy.append(xy[dist_max_idx])
    xy = np.delete(xy, dist_max_idx, axis = 0)

实际上它会创建集群，这是正常现象，因为您倾向于创建彼此之间距离最远的点集群。不仅如此，由于边界条件，我们将沿一个轴的 2 点之间的最大距离设置为 500。因此，两个簇之间的最大距离也是 500！正如您在图片上看到的那样，情况就是如此。

此外，选择更多数字将开始画线以连接不同的集群，从中央的开始，如您在此处看到的那样：

Answer 2

我要找的是 'Furthest Point Sampling'。我对该解决方案进行了更多研究，可在此处找到用于执行此操作的 Python 代码：https://minibatchai.com/ai/2021/08/07/FPS.html

如何根据样本相对于已经 select 编辑的样本的距离 select 数值样本 (Python)

How to select numeric samples based on their distance relative to samples already selected (Python)

python

arrays

numpy

selection

sampling