根据 pandas 数据帧中不同数量的邻居值标记一行

Flagging a row according to the various amount of neighbour values in a pandas dataframe

我得到了一个事务性操作,它生成如下提要:


df = pd.DataFrame({'action':['transacted','transacted','transacted','transacted','undo','transacted','transacted','transacted','transacted','transacted','undo','undo','undo','transacted'],
                  'transaction_count':10,20,35,60,60,60,80,90,100,10,10,100,90,90]})
动作 transaction_count
0 已成交 10
1 已成交 20
2 已成交 35
3 已成交 60
4 撤消 60
5 已成交 60
6 已成交 80
7 已成交 90
8 已成交 100
9 已成交 10
10 撤消 10
11 撤消 100
12 撤消 90
13 已成交 90

计数是有规律的,但不是线性的。 (10-20-35-60-80-90-100-10-20...)

undo 说明取消了哪个交易计数。

多次取消可以有多次撤销。

# This is an initial apply, to set it up
df['is_undone']=df.apply(lambda x: 1 if x['action']=='undo' else 0, axis=1).shift(-1)
df=df.fillna(0)  # For shift

df=df.loc[df['is_undone']==0]
df=df.fillna(0)
df=df.loc[df['action']!='undo']
df.reset_index(drop=True,inplace=True)

不幸的是,它只适用于单次撤消,而不适用于连续多次撤消。 Apply 不允许访问相邻行值,我想不出任何其他解决方案。应该也需要计算300k行,所以,性能也是个问题

预期结果是:

动作 transaction_count
0 已成交 10
1 已成交 20
2 已成交 35
3 已成交 60
4 已成交 80
5 已成交 90

提前致谢!

对这个优秀的 稍作修改可能会给你想要的东西:

解决方案

def undo(frame):
    d = {"transacted": 0, "undo": 1}
    condition = frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
    return frame[condition==0].reset_index(drop=True)
               
result = df.groupby("transaction_count").apply(undo).reset_index(drop=True)
>>> result
       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90

说明

groupby 用于分别处理每个 transaction_count。 例如,考虑 transaction_count 为 10 的情况。

frame = df[df["transaction_count"]==10]
>>> frame
        action  transaction_count
0   transacted                 10
9   transacted                 10
10        undo                 10

undo函数中,我们首先mapaction列给一个数字:

>>> frame["action"].map(d)
0     0
9     0
10    1

认识到我们要删除 0(已处理)后紧跟着 1(撤消)的行。在上面,这对应于索引为 910.

的行

为此,我们使用 pd.rollinglambda:

一次处理上述帧的 2 行
>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1])
0     NaN
9     0.0
10    1.0

现在,mask 0 到 np.nanbfill(回填)恰好一次,fillna0

>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
0     0.0
9     1.0
10    1.0
Name: action, dtype: float64

从上面,我们需要所有不等于 1 的行。 这是 undo 函数返回的内容。

如果transaction count在一个块中是唯一的。可以使用以下方式创建组:

df['group'] = (df['action'].eq('transacted') &
               df['action'].shift().eq('undo')).cumsum()
        action  transaction_count  group
0   transacted                 10      0
1   transacted                 20      0
2   transacted                 35      0
3   transacted                 60      0
4         undo                 60      0
5   transacted                 60      1
6   transacted                 80      1
7   transacted                 90      1
8   transacted                100      1
9   transacted                 10      1
10        undo                 10      1
11        undo                100      1
12        undo                 90      1
13  transacted                 90      2

然后 drop_duplicates 可用于删除重复的 transaction_count 每个 group:

df = (df.drop_duplicates(['transaction_count', 'group'], keep=False)
      .drop('group', axis=1)
      .reset_index(drop=True))

df:

       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90

如果没有,可以使用反向路径创建组,以创建关联交易和撤消的类似 ID:

s = df['action'].shift()
m = df['action'].ne(s).cumsum()
df['group'] = (df['action'].eq('transacted') & s.eq('undo')).cumsum()
df['new'] = (
    df.groupby(['action', m]).cumcount()
        .mask(df['action'].eq('transacted'),
              df.loc[::-1].groupby(['action', m]).cumcount())
)
        action  transaction_count  group  new
0   transacted                 10      0    3
1   transacted                 20      0    2
2   transacted                 35      0    1
3   transacted                 60      0    0
4         undo                 60      0    0
5   transacted                 60      1    4
6   transacted                 80      1    3
7   transacted                 90      1    2  # Matches Undo 2
8   transacted                100      1    1  # Matches Undo 1
9   transacted                 10      1    0  # Matches Undo 0
10        undo                 10      1    0  # Undo 0
11        undo                100      1    1  # Undo 1
12        undo                 90      1    2  # Undo 2
13  transacted                 90      2    0

然后可以在groupnew之间删除重复项:

df = (df
      .drop_duplicates(['group', 'new'], keep=False)
      .drop(['group', 'new'], axis=1)
      .reset_index(drop=True))
       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90
df['is_undone']=0
for k, v in df.groupby((df['action'].shift() != df['action']).cumsum()):
    if v['action'].max()=='undo':
        df.loc[v['action'].index[0]-v['action'].count():v['action'].index[v['action'].count()-1],'is_undone']=1
df=df.loc[df['is_undone']==0]
df.drop('is_undone',axis=1,inplace=True)
df.reset_index(drop=True,inplace=True)

我发现 this article 将连续的代码块分组。代码运行不到一秒,这对我来说已经足够了。