如何在具有非唯一合并键的情况下使用 pandas 合并函数合并数据集

How to merge datastes using pandas merge function while having non-unique merging keys

我有两个文件中的大量数据,我想将它们合并,但合并键 'ChannelPartnerID' 在其中一个数据集中的每一行中都没有唯一值。我希望它将它与第二个数据集中的第一个匹配行合并,并将 NaN 值赋给其余行。 这是来自数据的示例

import pandas as pd 
t1 = {'ChannelPartnerID': [18263,50429,98593,44804,81015,15273,51484,87695,33906,70117],
 'response': [1, 0, 1, 0, 0, 1, 1, 0, 0, 0],
 'n_comp': [2, 1, 0, 4, 4, 2, 1, 3, 3, 5],
 'loyalty': [1, 1, 1, 1, 1, 1, 1, 0, 1, 1],
 'portal': [1, 1, 0, 1, 1, 1, 0, 1, 0, 1],
 'rewards': [0, 1, 0, 1, 1, 1, 0, 0, 1, 1],
 'nps': [7, 3, 9, 2, 2, 5, 6, 8, 2, 8],
 'n_yrs': [8, 3, 6, 5, 2, 7, 6, 3, 3, 6]}

t2 = {'ChannelPartnerID': [18263,10000,18263,33906,10000,81015,10001,81015,10001,33906],
 'Month': [5, 10, 8, 10, 1, 4, 3, 10, 12, 7],
 'Year': [2019, 2019, 2019, 2019, 2018, 2018, 2019, 2018, 2019, 2019],
 'Brand': ['B4', 'B5', 'B1', 'B3', 'B2', 'B2', 'B2', 'B5', 'B2', 'B1'],
 'Sales': [21793, 7155, 29630, 1530, 3965, 34608, 39256, 14612, 2902, 6122]}

t1 = pd.DataFrame(t1)
t1

输出:

ChannelPartnerID    response    n_comp  loyalty portal  rewards nps n_yrs
0   18263   1   2   1   1   0   7   8
1   50429   0   1   1   1   1   3   3
2   98593   1   0   1   0   0   9   6
3   44804   0   4   1   1   1   2   5
4   81015   0   4   1   1   1   2   2
5   15273   1   2   1   1   1   5   7
6   51484   1   1   1   0   0   6   6
7   87695   0   3   0   1   0   8   3
8   33906   0   3   1   0   1   2   3
9   70117   0   5   1   1   1   8   6


t2 = pd.DataFrame(t2)
t2

输出:

ChannelPartnerID    Month   Year    Brand   Sales
0   18263   5   2019    B4  21793
1   10000   10  2019    B5  7155
2   18263   8   2019    B1  29630
3   33906   10  2019    B3  1530
4   10000   1   2018    B2  3965
5   81015   4   2018    B2  34608
6   10001   3   2019    B2  39256
7   81015   10  2018    B5  14612
8   10001   12  2019    B2  2902
9   33906   7   2019    B1  6122

pd.merge(t1,t2, on=['ChannelPartnerID'],how= 'left' )

输出:

ChannelPartnerID    response    n_comp  loyalty portal  rewards nps n_yrs   Month   Year    Brand   Sales
0   18263   1   2   1   1   0   7   8   5.0 2019.0  B4  21793.0
1   18263   1   2   1   1   0   7   8   8.0 2019.0  B1  29630.0
2   50429   0   1   1   1   1   3   3   NaN NaN NaN NaN
3   98593   1   0   1   0   0   9   6   NaN NaN NaN NaN
4   44804   0   4   1   1   1   2   5   NaN NaN NaN NaN
5   81015   0   4   1   1   1   2   2   4.0 2018.0  B2  34608.0
6   81015   0   4   1   1   1   2   2   10.0    2018.0  B5  14612.0
7   15273   1   2   1   1   1   5   7   NaN NaN NaN NaN
8   51484   1   1   1   0   0   6   6   NaN NaN NaN NaN
9   87695   0   3   0   1   0   8   3   NaN NaN NaN NaN
10  33906   0   3   1   0   1   2   3   10.0    2019.0  B3  1530.0
11  33906   0   3   1   0   1   2   3   7.0 2019.0  B1  6122.0
12  70117   0   5   1   1   1   8   6   NaN NaN NaN NaN

我想要的

ChannelPartnerID    response    n_comp  loyalty portal  rewards nps n_yrs   Month   Year    Brand   Sales
0   18263   1   2   2   1   0   7   8   5.0 2019.0  B4  21793.0
1   18263   NaN NaN NaN NaN NaN NaN NaN 8.0 2019.0  B1  29630.0
2   50429   0   1   1   1   1   3   3   NaN NaN NaN NaN
3   98593   1   0   1   0   0   9   6   NaN NaN NaN NaN
4   44804   0   4   1   1   1   2   5   NaN NaN NaN NaN
5   81015   0   4   1   1   1   2   2   4.0 2018.0  B2  34608.0
6   81015   NaN NaN NaN NaN NaN NaN NaN 10.0 2018.0 B5  14612.0
7   15273   1   2   1   1   1   5   7   NaN NaN NaN NaN
8   51484   1   1   1   0   0   6   6   NaN NaN NaN NaN
9   87695   0   3   0   1   0   8   3   NaN NaN NaN NaN
10  33906   0   3   1   0   1   2   3   10.0 2019.0 B3  1530.0
11  33906   NaN NaN NaN NaN NaN NaN NaN 7.0 2019.0  B1  6122.0
12  70117   0   5   1   1   1   8   6   NaN NaN NaN NaN


您可以将以下内容用于您的用例。

t2_first = t2.groupby('ChannelPartnerID', as_index=False).nth(0)
t2_rest = t2[~(t2.index.isin(t2_first.index))]
t2_rest = t2_rest[(t2_rest['ChannelPartnerID'].isin(t1['ChannelPartnerID']))]
merge_1 = t1.merge(t2_first, on='ChannelPartnerID', how='left')

result = merge_1.append(t2_rest).sort_values('ChannelPartnerID')
result = result.reset_index(drop=True)
result

但是,我不确定您为什么要获得此 table 结构。如果您想要的是只有第一个匹配项的 table,可能 merge_1 就足够了。