用groupby逐行比较

Comparaison row by row with groupby

我有一个这样的数据框:

ordre/id /date /origine /destination /horaire A /horaire B

1 1112 2021-03-11 Paris / Marseille/10:00/14:00
2 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
3 1112 2021-03-11 Paris / Marseille/10:00/14:00
4 1114 2021-05-11 Paris / Bordeaux/10:20/14:00 
5 1112 2021-03-11 Paris / Marseille/10:00/14:00
6 1112 2021-03-11 Paris / Marseille/10:00/14:00
7 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
8 1114 2021-05-11 Paris / Bordeaux/10:00/14:00
9 1112 2021-03-11 Paris / Lyon/10:00/12:00

我想添加一个新列 note,它将根据相同的 iddate 存储每个对象组的比较值,任何更改 'date /origine /destination /horaire A /horaire B' 所以请注意 正确

示例:

输出:

1 1112 2021-03-11 Paris / Marseille/10:00/14:00
2 1112 2021-03-11 Paris / Marseille/10:00/14:00
3 1112 2021-03-11 Paris / Marseille/10:00/14:00
4 1112 2021-03-11 Paris / Lyon/10:00/12:00/True
5 1112 2021-03-11 Paris / Marseille/10:00/14:00/True
6 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
7 1114 2021-05-11 Paris / Bordeaux/09:00/13:00
8 1114 2021-05-11 Paris / Bordeaux/10:00/14:00/True
9 1114 2021-05-11 Paris / Bordeaux/10:20/14:00/True

我写了这段代码:

df['Note'] = df.groupby(['Date','id']).apply(lambda x: (x['Origine'] != x['Origine'].shift(-1)) | (x['Destination'] != x['Destination'].shift(-1)) | (x['Horaire A'] != x['Horaire A'].shift(-1)) | (x['Horaire B'] != x['Horaire B'].shift(-1)))
df['Note'] = df['Note'].shift(1)

但是这个程序报错:incompatible index of inserted column with frame index

我该如何解决?

IIUC,您可以将每组的行与移位的行进行比较。如果 any 字段不匹配,那么我们将输出设置为 True.

我在这里依靠“ordre”作为唯一键 merge 返回原始数据,但如果不是这种情况,您可以使用索引。在这种情况下,应该从 groupby 中删除“order”。

df.merge(df.set_index('ordre')
           .groupby(['id', 'date'], group_keys=False)
           .apply(lambda d: d.ne(d.shift().bfill()).any(1))
           .rename('diff_previous'),
         left_on='ordre', right_index=True
        )

输出:

   ordre    id        date origine destination horaire A horaire B  diff_previous
0      1  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
1      2  1114  2021-05-11   Paris    Bordeaux     09:00     13:00          False
2      3  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
3      4  1114  2021-05-11   Paris    Bordeaux     10:20     14:00           True
4      5  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
5      6  1112  2021-03-11   Paris   Marseille     10:00     14:00          False
6      7  1114  2021-05-11   Paris    Bordeaux     09:00     13:00           True
7      8  1114  2021-05-11   Paris    Bordeaux     10:00     14:00           True
8      9  1112  2021-03-11   Paris        Lyon     10:00     12:00           True

我使用以下代码生成数据帧:

data = {
    "ordre": range(1, 10),
    "id": [1112, 1114, 1112, 1114, 1112, 1112, 1114, 1114, 1112],
    "date": [
        "2021-03-11",
        "2021-05-11",
        "2021-03-11",
        "2021-05-11",
        "2021-03-11",
        "2021-03-11",
        "2021-05-11",
        "2021-05-11",
        "2021-03-11",
    ],
    "origine": ["Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris", "Paris"],
    "destination": [
        "Marseille",
        "Bordeaux",
        "Marseille",
        "Bordeaux",
        "Marseille",
        "Marseille",
        "Bordeaux",
        "Bordeaux",
        "Lyon",
    ],
    "horaire A": ["10:00", "09:00", "10:00", "10:20", "10:00", "10:00", "09:00", "10:00", "10:00"],
    "horaire B": ["14:00", "13:00", "14:00", "14:00", "14:00", "14:00", "13:00", "14:00", "12:00"],
}

df = pd.DataFrame(data)

那么思路就是:

  1. ("date", "id", "ordre")

    对数据进行排序
  2. "note" 设置为 True 如果 :

    一个。 ("date", "id") 与上一行相同

    b。 ("origine", "destination", "horaire A", "horaire B")其中一个与上一行不同

转化为:

index_cols = ["date", "id"]
compare_cols = ["origine", "destination", "horaire A", "horaire B"]

df = df.sort_values(by=index_cols+["ordre"])
shifted_compare = df[index_cols + compare_cols].shift(1).eq(df[index_cols + compare_cols])

df["note"] = shifted_compare[index_cols].all(axis=1) & ~shifted_compare[compare_cols].all(axis=1)

输出到:

>>> df.sort_values(by="ordre")
   ordre    id        date origine destination horaire A horaire B   note
0      1  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
1      2  1114  2021-05-11   Paris    Bordeaux     09:00     13:00  False
2      3  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
3      4  1114  2021-05-11   Paris    Bordeaux     10:20     14:00   True
4      5  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
5      6  1112  2021-03-11   Paris   Marseille     10:00     14:00  False
6      7  1114  2021-05-11   Paris    Bordeaux     09:00     13:00   True
7      8  1114  2021-05-11   Paris    Bordeaux     10:00     14:00   True
8      9  1112  2021-03-11   Paris        Lyon     10:00     12:00   True