在单个数据框中显示所有匹配对 - Python 记录链接

Question

我有一个 pandas MultiIndex 对象：

In [0]: index
Out[0]: 
MultiIndex(levels=[[1, 2, 3, 8], [10, 11]],
       labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])

这个 MultiIndex 对象定义了以下 8 对： (1,10), (1,11), (2,10), (2,11), (3,10), (3,11), (8,10), (8,11).

层级中列出的元素对应一个DataFrame的索引：

In [1]: df
Out[1]: 
     col_1   col_2
0        0       1
1        2       3
2        4       5
3        6       7
4        8       9
5       10      11
6       12      13
7       14      15
8       16      17
9       18      19
10      20      21
11      22      23

我想要创建一个新的 DataFrame 来显示上面定义的所有对。看起来像：

In [2]: result
Out[2]: 
    col_1   col_2     pair
        2       3        0
       20      21        0
        2       3        1
       22      23        1
        4       5        2
       20      21        2
        4       5        3
       22      23        3
        6       7        4
       20      21        4
        6       7        5
       22      23        5
       16      17        6
       20      21        6
       16      17        7
       22      23        7

有什么有效的方法可以实现吗？（如果可能，没有 for 循环）

提前致谢

Answer 1

设置

m = pd.MultiIndex(levels=[[1, 2, 3, 8], [10, 11]],
       labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])

可以对底层numpy数组进行操作

a = np.stack(m.values)
v = df.values
res = v[a]
c = res.shape[-1]

u = pd.DataFrame(res.reshape(-1, df.shape[1]), columns=df.columns)
u['pair'] = np.repeat(np.arange(u.shape[0] // c), c)

    col_1  col_2  pair
0       2      3     0
1      20     21     0
2       2      3     1
3      22     23     1
4       4      5     2
5      20     21     2
6       4      5     3
7      22     23     3
8       6      7     4
9      20     21     4
10      6      7     5
11     22     23     5
12     16     17     6
13     20     21     6
14     16     17     7
15     22     23     7

解释

当我们使用 MultiIndex 的所有组合对 DataFrame 的值进行索引时，我们不仅获得了正确的映射，而且还获得了输出维度中分组的行。我们可以使用此形状稍后推断出 pair 列。

print(v[a])

array([[[ 2,  3],
        [20, 21]],

       [[ 2,  3],
        [22, 23]],

       [[ 4,  5],
        [20, 21]],

       [[ 4,  5],
        [22, 23]],

       [[ 6,  7],
        [20, 21]],

       [[ 6,  7],
        [22, 23]],

       [[16, 17],
        [20, 21]],

       [[16, 17],
        [22, 23]]], dtype=int64)

Answer 2

`pd.concat`

不一定是最有效率的...但很聪明 (-:

pd.concat(
    [df.loc[[*pair]].assign(pair=i) for i, pair in enumerate(index)]
).reset_index(drop=True)

    col_1  col_2  pair
0       2      3     0
1      20     21     0
2       2      3     1
3      22     23     1
4       4      5     2
5      20     21     2
6       4      5     3
7      22     23     3
8       6      7     4
9      20     21     4
10      6      7     5
11     22     23     5
12     16     17     6
13     20     21     6
14     16     17     7
15     22     23     7

`zip`

同上

i_s, j_s = zip(*[(i, j) for j, p in enumerate(index) for i in p])
df.loc[[*i_s]].assign(pair=j_s).reset_index(drop=True)


    col_1  col_2  pair
0       2      3     0
1      20     21     0
2       2      3     1
3      22     23     1
4       4      5     2
5      20     21     2
6       4      5     3
7      22     23     3
8       6      7     4
9      20     21     4
10      6      7     5
11     22     23     5
12     16     17     6
13     20     21     6
14     16     17     7
15     22     23     7

Answer 3

使用 stack 与 iloc 或 reindex

df.iloc[m.to_frame().stack()].assign(key=m.to_frame().reset_index(drop=True).stack().index.get_level_values(0))
Out[205]: 
    col_1  col_2  key
1       2      3    0
10     20     21    0
1       2      3    1
11     22     23    1
2       4      5    2
10     20     21    2
2       4      5    3
11     22     23    3
3       6      7    4
10     20     21    4
3       6      7    5
11     22     23    5
8      16     17    6
10     20     21    6
8      16     17    7
11     22     23    7

在单个数据框中显示所有匹配对 - Python 记录链接

Show all matched pairs in a single dataframe - Python Record Linkage

python

multi-index

dataframe

pandas

record-linkage

`pd.concat`

`zip`