如何根据 pandas 数据框中另一列中的条目及时获取每个 ID 在该时间点的累积计数

How to get cumulative counts for each ID at that point in time based on entries in another column ina a pandas dataframe

我有一个如下所示的数据框:

CLIENT_ID ENCOUNTER_DATE CONDITION
8222 2020-01-01 Positive
8222 2020-03-02 Treated
8222 2020-04-18 Treated
8222 2020-07-31 Negative
8300 2017-06-10 Negative
8300 2017-09-11 Treated
8300 2018-02-01 Future Treatment
8300 2018-04-01 Treated
8300 2018-05-31 Negative
8400 2020-12-31 Future Treatment
8401 2017-08-29 Negative
8401 2017-09-15 Positive
8500 2018-10-10 Positive

这是创建 df 的代码:

df = pd.DataFrame({"CLIENT_ID": [8222, 8222, 8222, 8222, 8300, 8300, 8300, 8300, 8300, 8400, 8401, 8401, 8500],
                   "ENCOUNTER_DATE": ['2020-01-01', '2020-03-02', '2020-04-18', '2020-07-31', '2017-06-10', '2017-09-11', '2018-02-01', '2018-04-01', '2018-05-31', '2020-12-31', '2017-08-29', '2017-09-15', '2018-10-10'],
                   "CONDITION": ["positive", "treated", "treated", "negative", "negative", "treated", "future treatment", "treated", "negative", "future treatment", "negative", "positive", "positive"]})

manage_condition_list = ['positive','treated','future treatment']

table按CLIENT_IDDATE_ENCOUNTER排序。

我想获得客户 CLIENT_ID 在那个时间点在列表 manage_condition_list 中有一个 CONDITION 的累计计数(次数)。这样最终的数据框或输出将如下所示:

CLIENT_ID ENCOUNTER_DATE CONDITION CONDITION_COUNTS
8222 2020-01-01 Positive 1
8222 2020-03-02 Treated 2
8222 2020-04-18 Treated 3
8222 2020-07-31 Negative 3
8300 2017-06-10 Negative 0
8300 2017-09-11 Treated 1
8300 2018-02-01 Future Treatment 2
8300 2018-04-01 Treated 3
8300 2018-05-31 Negative 3
8400 2020-12-31 Future Treatment 1
8401 2017-08-29 Negative 0
8401 2017-09-15 Positive 1
8500 2018-10-10 Positive 1

请注意,真实数据中有更多的条目不在 manage_condition_list 中。我正在考虑 df.wherecumcount() + 1 的组合,但不太确定。

如果值在列 CONDITION 的列表 manage_condition_list 中,则使用 isin 获得 True,然后 CLIENT_ID 列

groupby.cumsum
df['CONDITION_COUNTS'] = (
    df['CONDITION'].isin(manage_condition_list)
      .groupby(df['CLIENT_ID']).cumsum()
)
print(df)
    CLIENT_ID ENCOUNTER_DATE         CONDITION  CONDITION_COUNTS
0        8222     2020-01-01          positive                 1
1        8222     2020-03-02           treated                 2
2        8222     2020-04-18           treated                 3
3        8222     2020-07-31          negative                 3
4        8300     2017-06-10          negative                 0
5        8300     2017-09-11           treated                 1
6        8300     2018-02-01  future treatment                 2
7        8300     2018-04-01           treated                 3
8        8300     2018-05-31          negative                 3
9        8400     2020-12-31  future treatment                 1
10       8401     2017-08-29          negative                 0
11       8401     2017-09-15          positive                 1
12       8500     2018-10-10          positive                 1

不确定我是否理解了 cum_counts 背后的逻辑,但希望这对您有所帮助

df['Cum_Count']= df.groupby('CLIENT_ID').cumcount('Condition')
df

df['Cum_Count']= df.groupby('CLIENT_ID')['CONDITION'].cumcount()

df['CONDITION_COUNTS'] = (df['CONDITION'].isin(manage_condition_list).groupby(df['CLIENT_ID']).cumcount())