如何创建指示器列以指示数据框中先前条目的特定更改?
How to create an indicator column to indicate specific change from a previous entry in a dataframe?
情况:
我目前有一个客户端数据框,按 CLIENT_ID and CURRENT_DATE_STATUS
.CLIENT_ID
排序,如下所示:
CLIENT_ID
CURRENT_DATE_STATUS
STATUS
10002
2017-07-21
STARTED
10002
2017-07-21
STARTED
10002
2018-07-01
CHURNED
10002
2018-07-01
CHURNED
10002
2019-01-01
RESTARTED
11811
2019-08-15
STARTED
11811
2019-08-15
STARTED
11811
2019-12-31
RESTARTED
22101
2020-03-11
STARTED
22101
2020-03-11
STARTED
22101
2020-03-11
STARTED
22101
2020-11-01
CHURNED
22300
2018-05-06
STARTED
22300
2018-05-06
STARTED
问题:
我如何创建指示器 Boolean 1 or 0
列来指示:
- 如果之前的
STATUS
条目已更改为每个 CLIENT_ID
的 CHURNED or RESTARTED
。
Objective:
生成的数据框如下所示:
CLIENT_ID
CURRENT_DATE_STATUS
STATUS
STOPPED
10002
2017-07-21
STARTED
0
10002
2017-07-21
STARTED
0
10002
2018-07-01
CHURNED
1
10002
2018-07-01
CHURNED
0
10002
2019-01-01
RESTARTED
1
11811
2019-08-15
STARTED
0
11811
2019-08-15
STARTED
0
11811
2019-12-31
RESTARTED
1
22101
2020-03-11
STARTED
0
22101
2020-03-11
STARTED
0
22101
2020-03-11
STARTED
0
22101
2020-11-01
CHURNED
1
22300
2018-05-06
STARTED
0
22300
2018-05-06
STARTED
0
用于生成所述数据框的代码:
import pandas as pd
data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)
您可以通过 Series.eq
with shifted per groups by DataFrameGroupBy.shift
for not equalSeries.ne
比较 eqaul 的实际值,对于按位 AND
通过 &
进行链接,对于位 OR
通过 |
进行最后一次链接转换为整数:
s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')
df['STOPPED'] = (m1 | m2).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
另一种解决方案是比较前一个移位值,然后如果匹配 Series.isin
中的列表,最后一个链 &
按位 AND
:
m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])
df['STOPPED'] = (m3 & m4).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
情况:
我目前有一个客户端数据框,按 CLIENT_ID and CURRENT_DATE_STATUS
.CLIENT_ID
排序,如下所示:
CLIENT_ID | CURRENT_DATE_STATUS | STATUS |
---|---|---|
10002 | 2017-07-21 | STARTED |
10002 | 2017-07-21 | STARTED |
10002 | 2018-07-01 | CHURNED |
10002 | 2018-07-01 | CHURNED |
10002 | 2019-01-01 | RESTARTED |
11811 | 2019-08-15 | STARTED |
11811 | 2019-08-15 | STARTED |
11811 | 2019-12-31 | RESTARTED |
22101 | 2020-03-11 | STARTED |
22101 | 2020-03-11 | STARTED |
22101 | 2020-03-11 | STARTED |
22101 | 2020-11-01 | CHURNED |
22300 | 2018-05-06 | STARTED |
22300 | 2018-05-06 | STARTED |
问题:
我如何创建指示器 Boolean 1 or 0
列来指示:
- 如果之前的
STATUS
条目已更改为每个CLIENT_ID
的CHURNED or RESTARTED
。
Objective:
生成的数据框如下所示:
CLIENT_ID | CURRENT_DATE_STATUS | STATUS | STOPPED |
---|---|---|---|
10002 | 2017-07-21 | STARTED | 0 |
10002 | 2017-07-21 | STARTED | 0 |
10002 | 2018-07-01 | CHURNED | 1 |
10002 | 2018-07-01 | CHURNED | 0 |
10002 | 2019-01-01 | RESTARTED | 1 |
11811 | 2019-08-15 | STARTED | 0 |
11811 | 2019-08-15 | STARTED | 0 |
11811 | 2019-12-31 | RESTARTED | 1 |
22101 | 2020-03-11 | STARTED | 0 |
22101 | 2020-03-11 | STARTED | 0 |
22101 | 2020-03-11 | STARTED | 0 |
22101 | 2020-11-01 | CHURNED | 1 |
22300 | 2018-05-06 | STARTED | 0 |
22300 | 2018-05-06 | STARTED | 0 |
用于生成所述数据框的代码:
import pandas as pd
data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)
您可以通过 Series.eq
with shifted per groups by DataFrameGroupBy.shift
for not equalSeries.ne
比较 eqaul 的实际值,对于按位 AND
通过 &
进行链接,对于位 OR
通过 |
进行最后一次链接转换为整数:
s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')
df['STOPPED'] = (m1 | m2).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
另一种解决方案是比较前一个移位值,然后如果匹配 Series.isin
中的列表,最后一个链 &
按位 AND
:
m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])
df['STOPPED'] = (m3 & m4).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0