比较两个单独数据帧 'groups' 中的 'order' 行，并按顺序找到 'swapped' 的行并提取原始索引

Question

我有两个看起来像这样的大数据框（每个都>10 GB）：

df1

     Identifier  Position Source  Location
1      AY1:2301        87    ch1        14
2     BC1U:4010       105    ch1        14
3     AC44:1230        90    ch1        15
4     AJC:93410        83    ch1        16
5     ABYY:0001       101    ch1        16
6        ABC:01        42    ch1        16
7       HH:A9CX       413    ch1        17
8       LK:9310         2    ch1        17
9     JFNE:3410       132    ch1        18
10    MKASDL:11        14    ch1        18
11   MKDFA:9401        18    ch1        18
12  MKASDL1:011       184    ch2        50
13   LKOC:AMC02        18    ch2        50
14     POI:1100       900    ch2        53
15    MCJE:09HA        11    ch2        53
16   ABYCI:1123        15    ch2        53
17     MNKA:410         1    ch2        53

df2

     Identifier  Position Source  Location
1      AY1:2301        87    ch1        14
2     BC1U:4010       105    ch1        14
3     AC44:1230        90    ch1        15
4        ABC:01        42    ch1        16
5     ABYY:0001       101    ch1        16
6     AJC:93410        83    ch1        16
7       HH:A9CX       413    ch1        17
8       LK:9310         2    ch1        17
9     MKASDL:11        14    ch1        18
10    JFNE:3410       132    ch1        18
11   MKDFA:9401        18    ch1        18
12  MKASDL1:011       184    ch2        50
13   LKOC:AMC02        18    ch2        50
14     MNKA:410         1    ch2        53
15     POI:1100       900    ch2        53
16   ABYCI:1123        15    ch2        53
17    MCJE:09HA        11    ch2        53

我想做一些类似于“差异”的事情，但是在 'group' 级别 (df.groupby(['Source', 'Location']))

当同一“Source/Location 组中的行顺序为 'swapped' 时，我想提取原始 “行号”。

整行内容当然应该匹配。

但我不知道该怎么做。我只能想到写一个 for 循环，当我的原始数据集有数百万行时，这将是非常低效的。

预期结果：

Group_Source:Location  df1.index  df2.index

ch1:16                         4          6
ch1:16                         6          4
ch1:18                         9         10
ch1:18                        10          9
ch2:53                        14         15
ch2:53                        15         17
ch2:53                        17         14

假设：

两个数据帧的行数相同
两个数据框是相同的（只是交换了行的顺序，所以如果两者都按源排序，然后是位置，然后是位置，然后是标识符，那么它们将完全相同）
'Swapped' 行在所有列中的内容总是完全匹配

Answer 1

根据您的示例考虑 df1 和 df2，这应该可以完成工作：

>>> (df1
.reset_index()
.merge(df2.reset_index(), on=['Source','Location','Identifier'])
.groupby(['Source','Location','index_x'], as_index=False)[['index_y']].first()
.query('index_x != index_y'))


Source  Location    index_x index_y
             ch1    16  4   6
             ch1    16  6   4
             ch1    18  9   10
             ch1    18  10  9
             ch2    53  14  15
             ch2    53  15  17
             ch2    53  17  14

比较两个单独数据帧 'groups' 中的 'order' 行，并按顺序找到 'swapped' 的行并提取原始索引

Compare 'order' of rows within 'groups' of two separate dataframes and find the rows that are 'swapped' in order and extract the original indexes

python

dataframe

pandas

dask