Python Pandas 添加数据行修订号
Python Pandas Add DataRow Revision Number
我有一个如下所示的数据框。我需要结合使用 ID 和序列号并计算修订号。
ID
Serial
Time
Revision
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
12/31/2020 8:37:13 AM
78faedd8-a250-4e52-ac81-a29d46715a51
PQR
12/03/2019 1:30:00 AM
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
12/31/2020 8:37:13 AM
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
01/23/2021 5:18:44 PM
78faedd8-a250-4e52-ac81-a29d46715a51
ABC
10/23/2020 8:01:08 AM
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
01/20/2021 8:10:27 PM
预期结果:
ID
Serial
Time
Revision
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
12/31/2020 8:37:13 AM
1
78faedd8-a250-4e52-ac81-a29d46715a51
PQR
12/03/2019 1:30:00 AM
1
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
12/31/2020 8:37:13 AM
1
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
01/23/2021 5:18:44 PM
3
78faedd8-a250-4e52-ac81-a29d46715a51
ABC
10/23/2020 8:01:08 AM
1
48ff35eb-70ad-4dcd-a441-8c7c9966a236
ABC
01/20/2021 8:10:27 PM
2
我尝试了以下方法:
columns_of_interest = ["ID", "Serial"]
df["revision"] = df.groupby(columns_of_interest).cumcount() + 1
然后,我只获得了每个组的行数,但我如何获得准确的版本号?
我认为你正在寻找密集的 rank
:
# `rank` only works with numerical
df['Time'] = pd.to_datetime(df['Time'])
df['Revision'] = df.groupby(columns_of_interest)['Time'].rank(method='dense')
输出:
ID Serial Time Revision
0 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1.0
1 78faedd8-a250-4e52-ac81-a29d46715a51 PQR 2019-12-03 01:30:00 1.0
2 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1.0
3 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-23 17:18:44 3.0
4 78faedd8-a250-4e52-ac81-a29d46715a51 ABC 2020-10-23 08:01:08 1.0
5 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-20 20:10:27 2.0
两个步骤,首先让我们删除重复项,对值进行排序并按 ID
和 Serial
创建一个计数器。
然后我们可以向前填充任何缺失的重复值:
df['Revision'] = df.index.map(df.drop_duplicates(subset=['ID','Serial','Time'],keep='first')\
.sort_values('Time').groupby(['ID','Serial']).cumcount() + 1)
df['Revision'] = df.groupby(['ID','Serial'])['revision'].ffill().astype(int)
ID Serial Time Revision
0 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1
1 78faedd8-a250-4e52-ac81-a29d46715a51 PQR 2019-12-03 01:30:00 1
2 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1
3 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-23 17:18:44 3
4 78faedd8-a250-4e52-ac81-a29d46715a51 ABC 2020-10-23 08:01:08 1
5 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-20 20:10:27 2
我有一个如下所示的数据框。我需要结合使用 ID 和序列号并计算修订号。
ID | Serial | Time | Revision |
---|---|---|---|
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 12/31/2020 8:37:13 AM | |
78faedd8-a250-4e52-ac81-a29d46715a51 | PQR | 12/03/2019 1:30:00 AM | |
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 12/31/2020 8:37:13 AM | |
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 01/23/2021 5:18:44 PM | |
78faedd8-a250-4e52-ac81-a29d46715a51 | ABC | 10/23/2020 8:01:08 AM | |
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 01/20/2021 8:10:27 PM |
预期结果:
ID | Serial | Time | Revision |
---|---|---|---|
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 12/31/2020 8:37:13 AM | 1 |
78faedd8-a250-4e52-ac81-a29d46715a51 | PQR | 12/03/2019 1:30:00 AM | 1 |
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 12/31/2020 8:37:13 AM | 1 |
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 01/23/2021 5:18:44 PM | 3 |
78faedd8-a250-4e52-ac81-a29d46715a51 | ABC | 10/23/2020 8:01:08 AM | 1 |
48ff35eb-70ad-4dcd-a441-8c7c9966a236 | ABC | 01/20/2021 8:10:27 PM | 2 |
我尝试了以下方法:
columns_of_interest = ["ID", "Serial"]
df["revision"] = df.groupby(columns_of_interest).cumcount() + 1
然后,我只获得了每个组的行数,但我如何获得准确的版本号?
我认为你正在寻找密集的 rank
:
# `rank` only works with numerical
df['Time'] = pd.to_datetime(df['Time'])
df['Revision'] = df.groupby(columns_of_interest)['Time'].rank(method='dense')
输出:
ID Serial Time Revision
0 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1.0
1 78faedd8-a250-4e52-ac81-a29d46715a51 PQR 2019-12-03 01:30:00 1.0
2 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1.0
3 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-23 17:18:44 3.0
4 78faedd8-a250-4e52-ac81-a29d46715a51 ABC 2020-10-23 08:01:08 1.0
5 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-20 20:10:27 2.0
两个步骤,首先让我们删除重复项,对值进行排序并按 ID
和 Serial
创建一个计数器。
然后我们可以向前填充任何缺失的重复值:
df['Revision'] = df.index.map(df.drop_duplicates(subset=['ID','Serial','Time'],keep='first')\
.sort_values('Time').groupby(['ID','Serial']).cumcount() + 1)
df['Revision'] = df.groupby(['ID','Serial'])['revision'].ffill().astype(int)
ID Serial Time Revision
0 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1
1 78faedd8-a250-4e52-ac81-a29d46715a51 PQR 2019-12-03 01:30:00 1
2 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2020-12-31 08:37:13 1
3 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-23 17:18:44 3
4 78faedd8-a250-4e52-ac81-a29d46715a51 ABC 2020-10-23 08:01:08 1
5 48ff35eb-70ad-4dcd-a441-8c7c9966a236 ABC 2021-01-20 20:10:27 2