如何根据 pandas 数据框中另一列中的条目及时获取每个 ID 在该时间点的累积计数
How to get cumulative counts for each ID at that point in time based on entries in another column ina a pandas dataframe
我有一个如下所示的数据框:
CLIENT_ID
ENCOUNTER_DATE
CONDITION
8222
2020-01-01
Positive
8222
2020-03-02
Treated
8222
2020-04-18
Treated
8222
2020-07-31
Negative
8300
2017-06-10
Negative
8300
2017-09-11
Treated
8300
2018-02-01
Future Treatment
8300
2018-04-01
Treated
8300
2018-05-31
Negative
8400
2020-12-31
Future Treatment
8401
2017-08-29
Negative
8401
2017-09-15
Positive
8500
2018-10-10
Positive
这是创建 df 的代码:
df = pd.DataFrame({"CLIENT_ID": [8222, 8222, 8222, 8222, 8300, 8300, 8300, 8300, 8300, 8400, 8401, 8401, 8500],
"ENCOUNTER_DATE": ['2020-01-01', '2020-03-02', '2020-04-18', '2020-07-31', '2017-06-10', '2017-09-11', '2018-02-01', '2018-04-01', '2018-05-31', '2020-12-31', '2017-08-29', '2017-09-15', '2018-10-10'],
"CONDITION": ["positive", "treated", "treated", "negative", "negative", "treated", "future treatment", "treated", "negative", "future treatment", "negative", "positive", "positive"]})
manage_condition_list = ['positive','treated','future treatment']
table按CLIENT_ID
和DATE_ENCOUNTER
排序。
我想获得客户 CLIENT_ID
在那个时间点在列表 manage_condition_list
中有一个 CONDITION
的累计计数(次数)。这样最终的数据框或输出将如下所示:
CLIENT_ID
ENCOUNTER_DATE
CONDITION
CONDITION_COUNTS
8222
2020-01-01
Positive
1
8222
2020-03-02
Treated
2
8222
2020-04-18
Treated
3
8222
2020-07-31
Negative
3
8300
2017-06-10
Negative
0
8300
2017-09-11
Treated
1
8300
2018-02-01
Future Treatment
2
8300
2018-04-01
Treated
3
8300
2018-05-31
Negative
3
8400
2020-12-31
Future Treatment
1
8401
2017-08-29
Negative
0
8401
2017-09-15
Positive
1
8500
2018-10-10
Positive
1
请注意,真实数据中有更多的条目不在 manage_condition_list
中。我正在考虑 df.where
和 cumcount() + 1
的组合,但不太确定。
如果值在列 CONDITION 的列表 manage_condition_list
中,则使用 isin
获得 True,然后 CLIENT_ID 列
groupby.cumsum
df['CONDITION_COUNTS'] = (
df['CONDITION'].isin(manage_condition_list)
.groupby(df['CLIENT_ID']).cumsum()
)
print(df)
CLIENT_ID ENCOUNTER_DATE CONDITION CONDITION_COUNTS
0 8222 2020-01-01 positive 1
1 8222 2020-03-02 treated 2
2 8222 2020-04-18 treated 3
3 8222 2020-07-31 negative 3
4 8300 2017-06-10 negative 0
5 8300 2017-09-11 treated 1
6 8300 2018-02-01 future treatment 2
7 8300 2018-04-01 treated 3
8 8300 2018-05-31 negative 3
9 8400 2020-12-31 future treatment 1
10 8401 2017-08-29 negative 0
11 8401 2017-09-15 positive 1
12 8500 2018-10-10 positive 1
不确定我是否理解了 cum_counts 背后的逻辑,但希望这对您有所帮助
df['Cum_Count']= df.groupby('CLIENT_ID').cumcount('Condition')
df
或
df['Cum_Count']= df.groupby('CLIENT_ID')['CONDITION'].cumcount()
或
df['CONDITION_COUNTS'] = (df['CONDITION'].isin(manage_condition_list).groupby(df['CLIENT_ID']).cumcount())
我有一个如下所示的数据框:
CLIENT_ID | ENCOUNTER_DATE | CONDITION |
---|---|---|
8222 | 2020-01-01 | Positive |
8222 | 2020-03-02 | Treated |
8222 | 2020-04-18 | Treated |
8222 | 2020-07-31 | Negative |
8300 | 2017-06-10 | Negative |
8300 | 2017-09-11 | Treated |
8300 | 2018-02-01 | Future Treatment |
8300 | 2018-04-01 | Treated |
8300 | 2018-05-31 | Negative |
8400 | 2020-12-31 | Future Treatment |
8401 | 2017-08-29 | Negative |
8401 | 2017-09-15 | Positive |
8500 | 2018-10-10 | Positive |
这是创建 df 的代码:
df = pd.DataFrame({"CLIENT_ID": [8222, 8222, 8222, 8222, 8300, 8300, 8300, 8300, 8300, 8400, 8401, 8401, 8500],
"ENCOUNTER_DATE": ['2020-01-01', '2020-03-02', '2020-04-18', '2020-07-31', '2017-06-10', '2017-09-11', '2018-02-01', '2018-04-01', '2018-05-31', '2020-12-31', '2017-08-29', '2017-09-15', '2018-10-10'],
"CONDITION": ["positive", "treated", "treated", "negative", "negative", "treated", "future treatment", "treated", "negative", "future treatment", "negative", "positive", "positive"]})
manage_condition_list = ['positive','treated','future treatment']
table按CLIENT_ID
和DATE_ENCOUNTER
排序。
我想获得客户 CLIENT_ID
在那个时间点在列表 manage_condition_list
中有一个 CONDITION
的累计计数(次数)。这样最终的数据框或输出将如下所示:
CLIENT_ID | ENCOUNTER_DATE | CONDITION | CONDITION_COUNTS |
---|---|---|---|
8222 | 2020-01-01 | Positive | 1 |
8222 | 2020-03-02 | Treated | 2 |
8222 | 2020-04-18 | Treated | 3 |
8222 | 2020-07-31 | Negative | 3 |
8300 | 2017-06-10 | Negative | 0 |
8300 | 2017-09-11 | Treated | 1 |
8300 | 2018-02-01 | Future Treatment | 2 |
8300 | 2018-04-01 | Treated | 3 |
8300 | 2018-05-31 | Negative | 3 |
8400 | 2020-12-31 | Future Treatment | 1 |
8401 | 2017-08-29 | Negative | 0 |
8401 | 2017-09-15 | Positive | 1 |
8500 | 2018-10-10 | Positive | 1 |
请注意,真实数据中有更多的条目不在 manage_condition_list
中。我正在考虑 df.where
和 cumcount() + 1
的组合,但不太确定。
如果值在列 CONDITION 的列表 manage_condition_list
中,则使用 isin
获得 True,然后 CLIENT_ID 列
groupby.cumsum
df['CONDITION_COUNTS'] = (
df['CONDITION'].isin(manage_condition_list)
.groupby(df['CLIENT_ID']).cumsum()
)
print(df)
CLIENT_ID ENCOUNTER_DATE CONDITION CONDITION_COUNTS
0 8222 2020-01-01 positive 1
1 8222 2020-03-02 treated 2
2 8222 2020-04-18 treated 3
3 8222 2020-07-31 negative 3
4 8300 2017-06-10 negative 0
5 8300 2017-09-11 treated 1
6 8300 2018-02-01 future treatment 2
7 8300 2018-04-01 treated 3
8 8300 2018-05-31 negative 3
9 8400 2020-12-31 future treatment 1
10 8401 2017-08-29 negative 0
11 8401 2017-09-15 positive 1
12 8500 2018-10-10 positive 1
不确定我是否理解了 cum_counts 背后的逻辑,但希望这对您有所帮助
df['Cum_Count']= df.groupby('CLIENT_ID').cumcount('Condition')
df
或
df['Cum_Count']= df.groupby('CLIENT_ID')['CONDITION'].cumcount()
或
df['CONDITION_COUNTS'] = (df['CONDITION'].isin(manage_condition_list).groupby(df['CLIENT_ID']).cumcount())