如何在pandas中实现下面的逻辑?
How to realize the following logic in pandas?
我的初始数据框如下所示:
import pandas as pd
data = {'document':['abc','abc','abc','abc','xyz','xyz','xyz','test','test','test','test','test','test','test','test','test','stackover','stackover','stackover','stackover','stackover'],
'version':[1,2,3,4,1,2,3,1,2,3,4,5,6,7,8,9,3,4,5,6,7],
'status': [100,100,100,16,200,200,11,11,11,11,15,15,11,15,15,15,10,10,100,15,10]}
df = pd.DataFrame(data)
df
现在我想添加列 'traffic light'。单元格的条件格式只是为了更好的可视化:
红绿灯的颜色是这样产生的:
'status' 100 or 200: 表示文档已经发布
所有其他 'status'(例如 16 或 10):未发布
green: 最高文档版本必须为'green'
red:有更高版本发布(状态100或200)。
黄色: 有更高版本未发布(不是状态 100 或 200)。
这可以直接用 pandas 函数实现吗?还是我需要 numpy?可能最好的做法是首先构建黄色和红色的逻辑,然后将最高版本设置为绿色或?
试试 numpy.select
:
import numpy as np
#get maximum version for each document: green
green = df["version"].eq(df.groupby("document")["version"].transform("max"))
#get maximum version for each document with released status: red
red = df["version"].lt(df["document"].map(df[df["status"].isin([100,200])].groupby("document")["version"].max()))
df["traffic light"] = np.select([green, red], ["green", "red"], "yellow")
>>> df
document version status traffic light
0 abc 1 100 red
1 abc 2 100 red
2 abc 3 100 yellow
3 abc 4 16 green
4 xyz 1 200 red
5 xyz 2 200 yellow
6 xyz 3 11 green
7 test 1 11 yellow
8 test 2 11 yellow
9 test 3 11 yellow
10 test 4 15 yellow
11 test 5 15 yellow
12 test 6 11 yellow
13 test 7 15 yellow
14 test 8 15 yellow
15 test 9 15 green
16 stackover 3 10 red
17 stackover 4 10 red
18 stackover 5 100 yellow
19 stackover 6 15 yellow
20 stackover 7 10 green
IIUC,你可以使用:
# make group
g = df.assign(released=df['status'].isin([100,200])).groupby('document')
# get green values
green = df['version'].eq(g['version'].transform('max'))
# get next release
next_released = g['released'].apply(lambda s: s[::-1].cummax().shift(1, fill_value=False)[::-1])
# select values
import numpy as np
df['traffic light'] = np.select([green, next_released], ['green', 'red'], 'yellow')
输出:
document version status traffic light
0 abc 1 100 red
1 abc 2 100 red
2 abc 3 100 yellow
3 abc 4 16 green
4 xyz 1 200 red
5 xyz 2 200 yellow
6 xyz 3 11 green
7 test 1 11 yellow
8 test 2 11 yellow
9 test 3 11 yellow
10 test 4 15 yellow
11 test 5 15 yellow
12 test 6 11 yellow
13 test 7 15 yellow
14 test 8 15 yellow
15 test 9 15 green
16 stackover 3 10 red
17 stackover 4 10 red
18 stackover 5 100 yellow
19 stackover 6 15 yellow
20 stackover 7 10 green
这是添加 2 列的逐步解决方案:
import pandas as pd
import numpy
data = {'document':['abc','abc','abc','abc','xyz','xyz','xyz','test','test','test','test','test','test','test','test','test','stackover','stackover','stackover','stackover','stackover'],
'version':[1,2,3,4,1,2,3,1,2,3,4,5,6,7,8,9,3,4,5,6,7],
'status': [100,100,100,16,200,200,11,11,11,11,15,15,11,15,15,15,10,10,100,15,10]}
df = pd.DataFrame(data)
df['max version'] = df.groupby('document')['version'].transform('max')
df['max release version'] = df.loc[df["status"]>=100].groupby(['document'])['version'].transform('max')
df['max release version'] = df.groupby('document')['max release version'].transform('max')
df['traffic light'] = numpy.where( df['version']==df['max version'], 'green',
numpy.where( (df['max release version'].isnull())
| ((df['max release version']==df['version'])&(df['version']!=df['max version'])), 'yellow',
numpy.where( df['version']!=df['max release version'], 'red',
numpy.nan)))
df
您可以删除列:
df.drop(['max version','max release version'], axis='columns', inplace=True)
我的初始数据框如下所示:
import pandas as pd
data = {'document':['abc','abc','abc','abc','xyz','xyz','xyz','test','test','test','test','test','test','test','test','test','stackover','stackover','stackover','stackover','stackover'],
'version':[1,2,3,4,1,2,3,1,2,3,4,5,6,7,8,9,3,4,5,6,7],
'status': [100,100,100,16,200,200,11,11,11,11,15,15,11,15,15,15,10,10,100,15,10]}
df = pd.DataFrame(data)
df
现在我想添加列 'traffic light'。单元格的条件格式只是为了更好的可视化:
红绿灯的颜色是这样产生的:
'status' 100 or 200: 表示文档已经发布
所有其他 'status'(例如 16 或 10):未发布
green: 最高文档版本必须为'green'
red:有更高版本发布(状态100或200)。
黄色: 有更高版本未发布(不是状态 100 或 200)。
这可以直接用 pandas 函数实现吗?还是我需要 numpy?可能最好的做法是首先构建黄色和红色的逻辑,然后将最高版本设置为绿色或?
试试 numpy.select
:
import numpy as np
#get maximum version for each document: green
green = df["version"].eq(df.groupby("document")["version"].transform("max"))
#get maximum version for each document with released status: red
red = df["version"].lt(df["document"].map(df[df["status"].isin([100,200])].groupby("document")["version"].max()))
df["traffic light"] = np.select([green, red], ["green", "red"], "yellow")
>>> df
document version status traffic light
0 abc 1 100 red
1 abc 2 100 red
2 abc 3 100 yellow
3 abc 4 16 green
4 xyz 1 200 red
5 xyz 2 200 yellow
6 xyz 3 11 green
7 test 1 11 yellow
8 test 2 11 yellow
9 test 3 11 yellow
10 test 4 15 yellow
11 test 5 15 yellow
12 test 6 11 yellow
13 test 7 15 yellow
14 test 8 15 yellow
15 test 9 15 green
16 stackover 3 10 red
17 stackover 4 10 red
18 stackover 5 100 yellow
19 stackover 6 15 yellow
20 stackover 7 10 green
IIUC,你可以使用:
# make group
g = df.assign(released=df['status'].isin([100,200])).groupby('document')
# get green values
green = df['version'].eq(g['version'].transform('max'))
# get next release
next_released = g['released'].apply(lambda s: s[::-1].cummax().shift(1, fill_value=False)[::-1])
# select values
import numpy as np
df['traffic light'] = np.select([green, next_released], ['green', 'red'], 'yellow')
输出:
document version status traffic light
0 abc 1 100 red
1 abc 2 100 red
2 abc 3 100 yellow
3 abc 4 16 green
4 xyz 1 200 red
5 xyz 2 200 yellow
6 xyz 3 11 green
7 test 1 11 yellow
8 test 2 11 yellow
9 test 3 11 yellow
10 test 4 15 yellow
11 test 5 15 yellow
12 test 6 11 yellow
13 test 7 15 yellow
14 test 8 15 yellow
15 test 9 15 green
16 stackover 3 10 red
17 stackover 4 10 red
18 stackover 5 100 yellow
19 stackover 6 15 yellow
20 stackover 7 10 green
这是添加 2 列的逐步解决方案:
import pandas as pd
import numpy
data = {'document':['abc','abc','abc','abc','xyz','xyz','xyz','test','test','test','test','test','test','test','test','test','stackover','stackover','stackover','stackover','stackover'],
'version':[1,2,3,4,1,2,3,1,2,3,4,5,6,7,8,9,3,4,5,6,7],
'status': [100,100,100,16,200,200,11,11,11,11,15,15,11,15,15,15,10,10,100,15,10]}
df = pd.DataFrame(data)
df['max version'] = df.groupby('document')['version'].transform('max')
df['max release version'] = df.loc[df["status"]>=100].groupby(['document'])['version'].transform('max')
df['max release version'] = df.groupby('document')['max release version'].transform('max')
df['traffic light'] = numpy.where( df['version']==df['max version'], 'green',
numpy.where( (df['max release version'].isnull())
| ((df['max release version']==df['version'])&(df['version']!=df['max version'])), 'yellow',
numpy.where( df['version']!=df['max release version'], 'red',
numpy.nan)))
df
您可以删除列:
df.drop(['max version','max release version'], axis='columns', inplace=True)