如何在pandas中实现下面的逻辑?

How to realize the following logic in pandas?

我的初始数据框如下所示:

import pandas as pd

data = {'document':['abc','abc','abc','abc','xyz','xyz','xyz','test','test','test','test','test','test','test','test','test','stackover','stackover','stackover','stackover','stackover'],
        'version':[1,2,3,4,1,2,3,1,2,3,4,5,6,7,8,9,3,4,5,6,7],
        'status': [100,100,100,16,200,200,11,11,11,11,15,15,11,15,15,15,10,10,100,15,10]}

df = pd.DataFrame(data)

df

现在我想添加列 'traffic light'。单元格的条件格式只是为了更好的可视化:

红绿灯的颜色是这样产生的:

'status' 100 or 200: 表示文档已经发布

所有其他 'status'(例如 16 或 10):未发布

green: 最高文档版本必须为'green'

red:有更高版本发布(状态100或200)。

黄色: 有更高版本未发布(不是状态 100 或 200)。

这可以直接用 pandas 函数实现吗?还是我需要 numpy?可能最好的做法是首先构建黄色和红色的逻辑,然后将最高版本设置为绿色或?

试试 numpy.select:

import numpy as np

#get maximum version for each document: green
green = df["version"].eq(df.groupby("document")["version"].transform("max"))

#get maximum version for each document with released status: red
red = df["version"].lt(df["document"].map(df[df["status"].isin([100,200])].groupby("document")["version"].max()))

df["traffic light"] = np.select([green, red], ["green", "red"], "yellow")

>>> df
     document  version  status traffic light
0         abc        1     100           red
1         abc        2     100           red
2         abc        3     100        yellow
3         abc        4      16         green
4         xyz        1     200           red
5         xyz        2     200        yellow
6         xyz        3      11         green
7        test        1      11        yellow
8        test        2      11        yellow
9        test        3      11        yellow
10       test        4      15        yellow
11       test        5      15        yellow
12       test        6      11        yellow
13       test        7      15        yellow
14       test        8      15        yellow
15       test        9      15         green
16  stackover        3      10           red
17  stackover        4      10           red
18  stackover        5     100        yellow
19  stackover        6      15        yellow
20  stackover        7      10         green

IIUC,你可以使用:

# make group
g = df.assign(released=df['status'].isin([100,200])).groupby('document')

# get green values
green = df['version'].eq(g['version'].transform('max'))

# get next release
next_released = g['released'].apply(lambda s: s[::-1].cummax().shift(1, fill_value=False)[::-1])

# select values
import numpy as np
df['traffic light'] = np.select([green, next_released], ['green', 'red'], 'yellow')

输出:

     document  version  status traffic light
0         abc        1     100           red
1         abc        2     100           red
2         abc        3     100        yellow
3         abc        4      16         green
4         xyz        1     200           red
5         xyz        2     200        yellow
6         xyz        3      11         green
7        test        1      11        yellow
8        test        2      11        yellow
9        test        3      11        yellow
10       test        4      15        yellow
11       test        5      15        yellow
12       test        6      11        yellow
13       test        7      15        yellow
14       test        8      15        yellow
15       test        9      15         green
16  stackover        3      10           red
17  stackover        4      10           red
18  stackover        5     100        yellow
19  stackover        6      15        yellow
20  stackover        7      10         green

这是添加 2 列的逐步解决方案:

import pandas as pd
import numpy

data = {'document':['abc','abc','abc','abc','xyz','xyz','xyz','test','test','test','test','test','test','test','test','test','stackover','stackover','stackover','stackover','stackover'],
        'version':[1,2,3,4,1,2,3,1,2,3,4,5,6,7,8,9,3,4,5,6,7],
        'status': [100,100,100,16,200,200,11,11,11,11,15,15,11,15,15,15,10,10,100,15,10]}

df = pd.DataFrame(data)

df['max version'] = df.groupby('document')['version'].transform('max')
df['max release version'] = df.loc[df["status"]>=100].groupby(['document'])['version'].transform('max')
df['max release version'] = df.groupby('document')['max release version'].transform('max')

df['traffic light'] = numpy.where( df['version']==df['max version'], 'green',
                      numpy.where( (df['max release version'].isnull())
                                  | ((df['max release version']==df['version'])&(df['version']!=df['max version'])), 'yellow',
                      numpy.where( df['version']!=df['max release version'], 'red',
                                 numpy.nan)))
df

您可以删除列:

df.drop(['max version','max release version'], axis='columns', inplace=True)