根据条件编辑 Pandas、Python 中的日程文件的顺序和内容

Question

我正在尝试编辑 pandas、python 3 中的计划文件，但目前卡住了。

基本上，我有一个这样的 schedule 文件：

id    trip_id    origin    destination    courier_status     package_origin   package_destination
1        1         A           B              False                nan               nan
1        2         B           C               True                 X                 Y
2        1         F           G              False                nan               nan
2        2         G           H               True                 Q                 R
2        3         H           I              False                nan               nan

如果courier_status为真，我希望他们（id中的人）先绕道package_origin和package_destination再继续destination，从而改变了他们的计划文件。理想情况下，新的计划文件应该是这样的，newSchedule:

id    trip_id    origin    destination   status
1        1         A           B         normal
1        2         B           X         courier
1        3         X           Y         courier
1        4         Y           C         normal      
2        1         F           G         normal
2        2         G           Q         courier
2        3         Q           R         courier
2        4         R           H         normal
2        5         H           I         normal

我的想法是制作一个新的 df，仅包含额外的行程，然后将它们附加到现有的 schedule，然后删除重复项和 keep='last'，然后应用sort_values 在 id 上。但是，我无法制作 newSchedule DataFrame。任何人都可以帮助我或指导我应该使用哪种算法吗？我正在考虑使用循环或使用 np.where?

真正的数据有更多的列和行，我只是想知道如何使用它。我是使用 python 的菜鸟，所以我现在很迷茫。

请帮忙！

Answer 1

这是一种选择。首先，您可以根据 courier_status 列拆分 DataFrame。方法很多，这里我用的是groupby：

(_, df_n), (_, df_c) = df.groupby('courier_status')

普通的DataFrame很容易处理，只需要删除一些列并分配状态：

df_n['status'] = 'normal'
df_n = df_n.drop(columns=['courier_status', 'package_origin', 'package_destination', 'trip_id'])

courier DataFrame 需要做更多的工作。在这里，我们需要从 ['origin', 'package_origin', 'package_destination', 'destination'] 形成链，这可以通过指定该顺序、堆叠和连接一个移位版本来完成。对我放入索引中但需要保留的内容进行了一些清理。最后将除最后 'package_origin' -> 'package_destination' 部分之外的所有内容分配为 'courier'.

的状态

s = (df_c.set_index(['id'], append=True)
        [['origin', 'package_origin', 'package_destination', 'destination']].stack()
    )

df_c = (pd.concat([s.rename('origin'), s.groupby(level=0).shift(-1).rename('destination')], axis=1)
          .dropna()
          .reset_index(['id'])
          .reset_index(-1, drop=True)
          .assign(status='courier'))

df_c.loc[~df_c.index.duplicated(keep='last'), 'status'] = 'normal'

最后，因为我们一直保留原始索引，所以我们可以 concat 将两者放在一起，然后 sort_index 将行按它们应该出现的顺序排列，并定义 'trip_id' 使用 groupby + cumcount:

result = pd.concat([df_n, df_c]).sort_index()
result['trip_id'] = result.groupby('id').cumcount()+1

#   id origin destination   status  trip_id
#0   1      A           B   normal        1
#1   1      B           X  courier        2
#1   1      X           Y  courier        3
#1   1      Y           C   normal        4
#2   2      F           G   normal        1
#3   2      G           Q  courier        2
#3   2      Q           R  courier        3
#3   2      R           H   normal        4
#4   2      H           I   normal        5

根据条件编辑 Pandas、Python 中的日程文件的顺序和内容

Editing the order and content of schedule file in Pandas, Python based on condition

python

schedule

numpy

pandas