两个 pandas 数据帧之间的最大匹配
Maximal matching between two pandas dataframes
假设我们有两个数据帧。
original_data
sequence_number
fixed_criteria
fuzzy_criteria
1
a
10.42
2
b
1.27
3
b
6.32
4
a
5.91
jumbled_data
sequence_number
fixed_criteria
fuzzy_criteria
11
b
6.43
12
b
1.26
13
a
9.98
14
a
15.84
15
a
6.01
然后我想对这些数据进行匹配,使它们一一对应。其中匹配最大化匹配的大小并最小化fuzzy_criteria中的差异。即匹配将是
sequence_number_original
fuzzy_criteria_original
fixed_criteria
fuzzy_criteria_jumbled
sequence_number_jumbled
fuzz_diff
1
10.42
a
9.98
13
0.44
2
1.27
b
1.26
12
0.01
3
6.32
b
6.43
11
0.11
4
5.91
a
6.01
15
0.1
编辑:
为了强调最大匹配的必要性,请考虑以下示例:
original_data
sequence_number
fixed_criteria
fuzzy_criteria
1
a
1
2
a
2
jumbled_data
sequence_number
fixed_criteria
fuzzy_criteria
13
a
1.9
14
a
2.9
然后匹配将提供(按最小差异排序):
sequence_number_original
fuzzy_criteria_original
fixed_criteria
fuzzy_criteria_jumbled
sequence_number_jumbled
fuzz_diff
2
2
a
1.9
13
0.1
1
1
a
1.9
13
0.9
2
2
a
2.9
14
0.9
1
1
a
2.9
14
1.9
然后删除 sequence_number_original 中的重复项将提供以下内容
sequence_number_original
fuzzy_criteria_original
fixed_criteria
fuzzy_criteria_jumbled
sequence_number_jumbled
fuzz_diff
2
2
a
1.9
13
0.1
1
1
a
1.9
13
0.9
然后在sequence_number_jumbled
sequence_number_original
fuzzy_criteria_original
fixed_criteria
fuzzy_criteria_jumbled
sequence_number_jumbled
fuzz_diff
2
2
a
1.9
13
0.1
反之亦然。首先 sequence_number_jumbled ...
sequence_number_original
fuzzy_criteria_original
fixed_criteria
fuzzy_criteria_jumbled
sequence_number_jumbled
fuzz_diff
2
2
a
1.9
13
0.1
2
2
a
2.9
14
0.9
然后sequence_number_original...
sequence_number_original
fuzzy_criteria_original
fixed_criteria
fuzzy_criteria_jumbled
sequence_number_jumbled
fuzz_diff
2
2
a
1.9
13
0.1
但是这不是最大的,因为有以下情况:
sequence_number_original
fuzzy_criteria_original
fixed_criteria
fuzzy_criteria_jumbled
sequence_number_jumbled
fuzz_diff
1
1
a
1.9
13
0.9
2
2
a
2.9
14
0.9
图论中有最大匹配算法。我确实只是看到 和我的相似。
如果 fuzzy_criteria
两列都没有重复值。您可以创建一个辅助数据框来确定两个 fuzzy_criteria
列之间最接近的值。
from itertools import product
df = pd.DataFrame(sorted(product(original_data['fuzzy_criteria'], jumbled_data['fuzzy_criteria']), key=lambda t: abs(t[0]-t[1])))
df = df.drop_duplicates(0, keep='first')
df = df.drop_duplicates(1, keep='first')
print(df)
0 1
0 1.27 1.26
1 5.91 6.01
2 6.32 6.43
4 10.42 9.98
然后使用这个辅助dataframe分别合并这两个dataframe,最后根据辅助dataframe列合并合并后的dataframe。
df_ = pd.merge(
(pd.merge(original_data, df, left_on='fuzzy_criteria', right_on=0)),
(pd.merge(df, jumbled_data, left_on=1, right_on='fuzzy_criteria')),
on=[0,1],
suffixes=('_original', '_jumbled')
).drop([0, 1], axis=1)
df_['fuzz_diff'] = (df_['fuzzy_criteria_original'] - df_['fuzzy_criteria_jumbled']).abs()
sequence_number_original fixed_criteria_original fuzzy_criteria_original \
0 1 a 10.42
1 2 b 1.27
2 3 b 6.32
3 4 a 5.91
sequence_number_jumbled fixed_criteria_jumbled fuzzy_criteria_jumbled \
0 13 a 9.98
1 12 b 1.26
2 11 b 6.43
3 15 a 6.01
fuzz_diff
0 0.44
1 0.01
2 0.11
3 0.10
这主要是从@SpghttCd 对
的回答中复制的
思路是使用networkx进行最大匹配。
import pandas as pd
import networkx as nx
# Data input
original_data = pd.DataFrame({
'sequence_number' : [1,2,3,4],
'fixed_criteria' : ['a','b','b','a'],
'fuzzy_criteria' : [10.42, 1.27, 6.32, 5.91]
})
jumbled_data = pd.DataFrame({
'sequence_number' : [11,12,13,14,15],
'fixed_criteria' : ['b','b','a','a','a'],
'fuzzy_criteria' : [6.43, 1.26, 9.98, 15.84, 6.01]
})
# Merge along fixed criteria
joined_data = pd.merge(
original_data,
jumbled_data,
how = 'inner',
on = ['fixed_criteria'],
suffixes=['_original','_jumbled']
)
# To use max weight, take the reciricol of the difference (if they are the non-
# unique values this will have to be changed)
joined_data['weight'] = (1/abs(
joined_data['fuzzy_criteria_original'] -
joined_data['fuzzy_criteria_jumbled']
))
# Form graph
matching_graph = nx.from_pandas_edgelist(
joined_data,
source = 'sequence_number_original',
target = 'sequence_number_jumbled',
edge_attr = 'weight'
)
# Find matching
mathing = nx.max_weight_matching(
matching_graph,
weight = 'weight'
)
# Convert results back into dataframe and format
results = pd.DataFrame(
list(mathing),
columns=['sequence_number_original', 'sequence_number_jumbled']
)
results = pd.merge(
results,
joined_data,
how = 'inner',
on = ['sequence_number_original', 'sequence_number_jumbled'],
)
results['fuzzy_difference'] = abs(
results['fuzzy_criteria_original'] -
results['fuzzy_criteria_jumbled']
)
print(results)
假设我们有两个数据帧。
original_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
1 | a | 10.42 |
2 | b | 1.27 |
3 | b | 6.32 |
4 | a | 5.91 |
jumbled_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
11 | b | 6.43 |
12 | b | 1.26 |
13 | a | 9.98 |
14 | a | 15.84 |
15 | a | 6.01 |
然后我想对这些数据进行匹配,使它们一一对应。其中匹配最大化匹配的大小并最小化fuzzy_criteria中的差异。即匹配将是
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
1 | 10.42 | a | 9.98 | 13 | 0.44 |
2 | 1.27 | b | 1.26 | 12 | 0.01 |
3 | 6.32 | b | 6.43 | 11 | 0.11 |
4 | 5.91 | a | 6.01 | 15 | 0.1 |
编辑:
为了强调最大匹配的必要性,请考虑以下示例:
original_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
1 | a | 1 |
2 | a | 2 |
jumbled_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
13 | a | 1.9 |
14 | a | 2.9 |
然后匹配将提供(按最小差异排序):
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
1 | 1 | a | 1.9 | 13 | 0.9 |
2 | 2 | a | 2.9 | 14 | 0.9 |
1 | 1 | a | 2.9 | 14 | 1.9 |
然后删除 sequence_number_original 中的重复项将提供以下内容
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
1 | 1 | a | 1.9 | 13 | 0.9 |
然后在sequence_number_jumbled
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
反之亦然。首先 sequence_number_jumbled ...
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
2 | 2 | a | 2.9 | 14 | 0.9 |
然后sequence_number_original...
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
但是这不是最大的,因为有以下情况:
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
1 | 1 | a | 1.9 | 13 | 0.9 |
2 | 2 | a | 2.9 | 14 | 0.9 |
图论中有最大匹配算法。我确实只是看到
如果 fuzzy_criteria
两列都没有重复值。您可以创建一个辅助数据框来确定两个 fuzzy_criteria
列之间最接近的值。
from itertools import product
df = pd.DataFrame(sorted(product(original_data['fuzzy_criteria'], jumbled_data['fuzzy_criteria']), key=lambda t: abs(t[0]-t[1])))
df = df.drop_duplicates(0, keep='first')
df = df.drop_duplicates(1, keep='first')
print(df)
0 1
0 1.27 1.26
1 5.91 6.01
2 6.32 6.43
4 10.42 9.98
然后使用这个辅助dataframe分别合并这两个dataframe,最后根据辅助dataframe列合并合并后的dataframe。
df_ = pd.merge(
(pd.merge(original_data, df, left_on='fuzzy_criteria', right_on=0)),
(pd.merge(df, jumbled_data, left_on=1, right_on='fuzzy_criteria')),
on=[0,1],
suffixes=('_original', '_jumbled')
).drop([0, 1], axis=1)
df_['fuzz_diff'] = (df_['fuzzy_criteria_original'] - df_['fuzzy_criteria_jumbled']).abs()
sequence_number_original fixed_criteria_original fuzzy_criteria_original \
0 1 a 10.42
1 2 b 1.27
2 3 b 6.32
3 4 a 5.91
sequence_number_jumbled fixed_criteria_jumbled fuzzy_criteria_jumbled \
0 13 a 9.98
1 12 b 1.26
2 11 b 6.43
3 15 a 6.01
fuzz_diff
0 0.44
1 0.01
2 0.11
3 0.10
这主要是从@SpghttCd 对
思路是使用networkx进行最大匹配。
import pandas as pd
import networkx as nx
# Data input
original_data = pd.DataFrame({
'sequence_number' : [1,2,3,4],
'fixed_criteria' : ['a','b','b','a'],
'fuzzy_criteria' : [10.42, 1.27, 6.32, 5.91]
})
jumbled_data = pd.DataFrame({
'sequence_number' : [11,12,13,14,15],
'fixed_criteria' : ['b','b','a','a','a'],
'fuzzy_criteria' : [6.43, 1.26, 9.98, 15.84, 6.01]
})
# Merge along fixed criteria
joined_data = pd.merge(
original_data,
jumbled_data,
how = 'inner',
on = ['fixed_criteria'],
suffixes=['_original','_jumbled']
)
# To use max weight, take the reciricol of the difference (if they are the non-
# unique values this will have to be changed)
joined_data['weight'] = (1/abs(
joined_data['fuzzy_criteria_original'] -
joined_data['fuzzy_criteria_jumbled']
))
# Form graph
matching_graph = nx.from_pandas_edgelist(
joined_data,
source = 'sequence_number_original',
target = 'sequence_number_jumbled',
edge_attr = 'weight'
)
# Find matching
mathing = nx.max_weight_matching(
matching_graph,
weight = 'weight'
)
# Convert results back into dataframe and format
results = pd.DataFrame(
list(mathing),
columns=['sequence_number_original', 'sequence_number_jumbled']
)
results = pd.merge(
results,
joined_data,
how = 'inner',
on = ['sequence_number_original', 'sequence_number_jumbled'],
)
results['fuzzy_difference'] = abs(
results['fuzzy_criteria_original'] -
results['fuzzy_criteria_jumbled']
)
print(results)