Python Pandas 添加数据行修订号

Python Pandas Add DataRow Revision Number

我有一个如下所示的数据框。我需要结合使用 ID 和序列号并计算修订号。

ID Serial Time Revision
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 12/31/2020 8:37:13 AM
78faedd8-a250-4e52-ac81-a29d46715a51 PQR 12/03/2019 1:30:00 AM
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 12/31/2020 8:37:13 AM
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 01/23/2021 5:18:44 PM
78faedd8-a250-4e52-ac81-a29d46715a51 ABC 10/23/2020 8:01:08 AM
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 01/20/2021 8:10:27 PM

预期结果:

ID Serial Time Revision
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 12/31/2020 8:37:13 AM 1
78faedd8-a250-4e52-ac81-a29d46715a51 PQR 12/03/2019 1:30:00 AM 1
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 12/31/2020 8:37:13 AM 1
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 01/23/2021 5:18:44 PM 3
78faedd8-a250-4e52-ac81-a29d46715a51 ABC 10/23/2020 8:01:08 AM 1
48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 01/20/2021 8:10:27 PM 2

我尝试了以下方法:

columns_of_interest = ["ID", "Serial"]
df["revision"] = df.groupby(columns_of_interest).cumcount() + 1

然后,我只获得了每个组的行数,但我如何获得准确的版本号?

我认为你正在寻找密集的 rank:

# `rank` only works with numerical
df['Time'] = pd.to_datetime(df['Time'])

df['Revision'] = df.groupby(columns_of_interest)['Time'].rank(method='dense')

输出:

                                     ID Serial                Time  Revision
0  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2020-12-31 08:37:13       1.0
1  78faedd8-a250-4e52-ac81-a29d46715a51    PQR 2019-12-03 01:30:00       1.0
2  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2020-12-31 08:37:13       1.0
3  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2021-01-23 17:18:44       3.0
4  78faedd8-a250-4e52-ac81-a29d46715a51    ABC 2020-10-23 08:01:08       1.0
5  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2021-01-20 20:10:27       2.0

两个步骤,首先让我们删除重复项,对值进行排序并按 IDSerial 创建一个计数器。

然后我们可以向前填充任何缺失的重复值:

df['Revision'] = df.index.map(df.drop_duplicates(subset=['ID','Serial','Time'],keep='first')\
                   .sort_values('Time').groupby(['ID','Serial']).cumcount() + 1)

df['Revision'] = df.groupby(['ID','Serial'])['revision'].ffill().astype(int)


                                     ID Serial                Time  Revision
0  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2020-12-31 08:37:13         1
1  78faedd8-a250-4e52-ac81-a29d46715a51    PQR 2019-12-03 01:30:00         1
2  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2020-12-31 08:37:13         1
3  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2021-01-23 17:18:44         3
4  78faedd8-a250-4e52-ac81-a29d46715a51    ABC 2020-10-23 08:01:08         1
5  48ff35eb-70ad-4dcd-a441-8c7c9966a236    ABC 2021-01-20 20:10:27         2