在 Python 中使用 KNN 将对照组与测试组唯一配对

Question

我想为测试组找到独特的配对，这意味着控制组中的每个人只能被选择一次。我有性别、年龄和教育可以匹配它们。我将性别和教育分组，因为它们是二元类别。之后，我想找到年龄与某个测试个体的最佳匹配 - 因此 KNN 方法有 1 个最近的邻居。我正在使用的 dummyData 可用 here.

下面是初始化和切分部分：

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors

TestGroup = pd.read_csv('KNN_DummyData1.csv', names = ['Gender', 'Age', 'Education'])
ControlGroup = pd.read_csv('KNN_DummyData2.csv', names = ['Gender', 'Age', 'Education'])

#### Split TestGroup and ControlGroup into males and females, high and low education
Males_highEd = TestGroup.loc[(TestGroup['Gender'] == 1) & (TestGroup['Education'] == 1)]
Males_highEd.reset_index(drop=True, inplace=True)
Males_highEd.drop(columns=['Gender', 'Education'], inplace=True)

Males_Ctrl_highEd = ControlGroup.loc[(ControlGroup['Gender'] == 1) & (ControlGroup['Education'] == 1)]
Males_Ctrl_highEd.reset_index(drop=True, inplace=True)
Males_Ctrl_highEd.drop(columns=['Gender', 'Education'], inplace=True)

这部分是实际的配对，我将其放入控制组并用控制组的值填充一个空的 DataFrame。匹配一个控件后，我尝试将其从原始 DataFrame (Males_Ctrl_highEd)

中删除

Matched_Males_Ctrl_highEd = pd.DataFrame().reindex_like(Males_highEd)
nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(Males_Ctrl_highEd)

for i in range(len(Males_highEd)):
    distances, indices = nbrs.kneighbors(Males_highEd[i:i+1])
    Matched_Males_Ctrl_highEd.loc[0].iat[i] = Males_Ctrl_highEd.loc[indices[0]]
    print(f"{i} controls of {len(Males_highEd)} tests found")
    Males_Ctrl_highEd = Males_Ctrl_highEd.drop(labels=indices[0], axis=0)

目前第 6 行出现以下错误：

ValueError: setting an array element with a sequence.

我已经尝试了多种方法来将控件分配到匹配的控件组中，但我似乎无法成功地将一个个体从原始 DataFrame 复制到空的 DataFrame 中。

如果有任何帮助，我在 MatLab 中做了一个有效的实现（但也需要在 Python 中）：

ControlGroup = Data;
Idx = NaN(length(Data),1);
for i=1:length(Data)
   Idx(i,1) = knnsearch(Data2,Data(i,:),'distance','seuclidean');
   ControlGroup(i,:) = Data2(Idx(i),:);
   Data2(Idx(i),:) = [];
end

如果您对可以执行相同操作的不同实现有任何想法或意见，我会洗耳恭听。

Answer 1

我最终在 KNN 匹配中仅使用年龄（并手动匹配二进制特征），执行以下解决方案：

neeededNeighbors = max(TestGroup["Age"].value_counts())+1
nn = NearestNeighbor(n_neighbors = neededNeighbors, algorithm="ball_tree", metric = "euclidian").fit(ControlGroup["Age"].to_numpy().reshape(-1,1))
TestGroup.sort_values(by="Age"),inplace=True)
distances, indices = nn.kneighbors(TestGroup["Age"].to_numpy().reshape(-1,1))

min_age = min(TestGroup["Age"])
max_age = max(TestGroup["Age"])
ages = list(range(min_year,max_year+1))
idx = pd.DataFrame(np.unique(indices,axis=0),index = ages)
cntr = pd.DataFrame(index=ages,colums=["cntrs"])
cntr["cntrs"] = 0

matchedControlGroup = pd.DataFrame().reindex_like(TestGroup)
matchedID = pd.DataFrame(np.full_like(np.arrange(len(matchedControlGroup)), np.nan, dtype=np.double))

for i in range(len(TestGroup)):
    if TestGroup["Age"].loc[i] in cntr.index:
    x = TestGroup["Age"].loc[i]
    matchedControlGroup.loc[i] = ControlGroup.loc[idx.loc[x][cntr.loc[x][0]]]
    cntr.loc[i] += 1
    matchedID.loc[i] = TestGroup["ID"].loc[i]

matchedID["ID_Match"] = matchedID

这样我就可以参考每个年龄组需要多少人，并迭代每个年龄组以获得与个人的下一个最佳匹配。这意味着每个年龄组中的第一个将获得更好的匹配，并且根据可用控件的数量，可能会有重叠。

我也做了一个没有发生这种情况的实现 - 但是，我找不到一种方法，在每次找到匹配项时我都不需要重新调整 KNN，这使得实现非常慢。

在 Python 中使用 KNN 将对照组与测试组唯一配对

Pairing control group uniquely to test group using KNN in Python

python

matching

knn

dataframe

pandas