两个 pandas 数据帧之间的最大匹配

Maximal matching between two pandas dataframes

假设我们有两个数据帧。

original_data

sequence_number fixed_criteria fuzzy_criteria
1 a 10.42
2 b 1.27
3 b 6.32
4 a 5.91

jumbled_data

sequence_number fixed_criteria fuzzy_criteria
11 b 6.43
12 b 1.26
13 a 9.98
14 a 15.84
15 a 6.01

然后我想对这些数据进行匹配,使它们一一对应。其中匹配最大化匹配的大小并最小化fuzzy_criteria中的差异。即匹配将是

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
1 10.42 a 9.98 13 0.44
2 1.27 b 1.26 12 0.01
3 6.32 b 6.43 11 0.11
4 5.91 a 6.01 15 0.1

编辑:

为了强调最大匹配的必要性,请考虑以下示例:

original_data

sequence_number fixed_criteria fuzzy_criteria
1 a 1
2 a 2

jumbled_data

sequence_number fixed_criteria fuzzy_criteria
13 a 1.9
14 a 2.9

然后匹配将提供(按最小差异排序):

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
1 1 a 1.9 13 0.9
2 2 a 2.9 14 0.9
1 1 a 2.9 14 1.9

然后删除 sequence_number_original 中的重复项将提供以下内容

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
1 1 a 1.9 13 0.9

然后在sequence_number_jumbled

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1

反之亦然。首先 sequence_number_jumbled ...

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
2 2 a 2.9 14 0.9

然后sequence_number_original...

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1

但是这不是最大的,因为有以下情况:

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
1 1 a 1.9 13 0.9
2 2 a 2.9 14 0.9

图论中有最大匹配算法。我确实只是看到 和我的相似。

如果 fuzzy_criteria 两列都没有重复值。您可以创建一个辅助数据框来确定两个 fuzzy_criteria 列之间最接近的值。

from itertools import product

df = pd.DataFrame(sorted(product(original_data['fuzzy_criteria'], jumbled_data['fuzzy_criteria']), key=lambda t: abs(t[0]-t[1])))
df = df.drop_duplicates(0, keep='first')
df = df.drop_duplicates(1, keep='first')
print(df)

       0     1
0   1.27  1.26
1   5.91  6.01
2   6.32  6.43
4  10.42  9.98

然后使用这个辅助dataframe分别合并这两个dataframe,最后根据辅助dataframe列合并合并后的dataframe。

df_ = pd.merge(
    (pd.merge(original_data, df, left_on='fuzzy_criteria', right_on=0)),
    (pd.merge(df, jumbled_data, left_on=1, right_on='fuzzy_criteria')),
    on=[0,1],
    suffixes=('_original', '_jumbled')
).drop([0, 1], axis=1)
df_['fuzz_diff'] = (df_['fuzzy_criteria_original'] - df_['fuzzy_criteria_jumbled']).abs()
   sequence_number_original fixed_criteria_original  fuzzy_criteria_original  \
0                         1                       a                    10.42
1                         2                       b                     1.27
2                         3                       b                     6.32
3                         4                       a                     5.91

   sequence_number_jumbled fixed_criteria_jumbled  fuzzy_criteria_jumbled  \
0                       13                      a                    9.98
1                       12                      b                    1.26
2                       11                      b                    6.43
3                       15                      a                    6.01

   fuzz_diff
0       0.44
1       0.01
2       0.11
3       0.10

这主要是从@SpghttCd 对

的回答中复制的

思路是使用networkx进行最大匹配。

import pandas as pd
import networkx as nx

# Data input

original_data = pd.DataFrame({
    'sequence_number' : [1,2,3,4],
    'fixed_criteria' : ['a','b','b','a'],
    'fuzzy_criteria' : [10.42, 1.27, 6.32, 5.91]
})

jumbled_data = pd.DataFrame({
    'sequence_number' : [11,12,13,14,15],
    'fixed_criteria' : ['b','b','a','a','a'],
    'fuzzy_criteria' : [6.43, 1.26, 9.98, 15.84, 6.01]
})

# Merge along fixed criteria

joined_data = pd.merge(
    original_data,
    jumbled_data,
    how = 'inner',
    on = ['fixed_criteria'],
    suffixes=['_original','_jumbled']
)

# To use max weight, take the reciricol of the difference (if they are the non-
# unique values this will have to be changed)

joined_data['weight'] = (1/abs(
    joined_data['fuzzy_criteria_original'] -
    joined_data['fuzzy_criteria_jumbled']
))

# Form graph

matching_graph = nx.from_pandas_edgelist(
    joined_data,
    source = 'sequence_number_original',
    target = 'sequence_number_jumbled',
    edge_attr = 'weight'
)

# Find matching

mathing = nx.max_weight_matching(
    matching_graph,
    weight = 'weight'
)

# Convert results back into dataframe and format

results = pd.DataFrame(
    list(mathing),
    columns=['sequence_number_original', 'sequence_number_jumbled']
)

results = pd.merge(
    results,
    joined_data,
    how = 'inner',
    on = ['sequence_number_original', 'sequence_number_jumbled'],
)

results['fuzzy_difference'] = abs(
    results['fuzzy_criteria_original'] -
    results['fuzzy_criteria_jumbled']
)

print(results)