根据 pandas 数据帧中不同数量的邻居值标记一行
Flagging a row according to the various amount of neighbour values in a pandas dataframe
我得到了一个事务性操作,它生成如下提要:
df = pd.DataFrame({'action':['transacted','transacted','transacted','transacted','undo','transacted','transacted','transacted','transacted','transacted','undo','undo','undo','transacted'],
'transaction_count':10,20,35,60,60,60,80,90,100,10,10,100,90,90]})
动作
transaction_count
0
已成交
10
1
已成交
20
2
已成交
35
3
已成交
60
4
撤消
60
5
已成交
60
6
已成交
80
7
已成交
90
8
已成交
100
9
已成交
10
10
撤消
10
11
撤消
100
12
撤消
90
13
已成交
90
计数是有规律的,但不是线性的。 (10-20-35-60-80-90-100-10-20...)
undo 说明取消了哪个交易计数。
多次取消可以有多次撤销。
# This is an initial apply, to set it up
df['is_undone']=df.apply(lambda x: 1 if x['action']=='undo' else 0, axis=1).shift(-1)
df=df.fillna(0) # For shift
df=df.loc[df['is_undone']==0]
df=df.fillna(0)
df=df.loc[df['action']!='undo']
df.reset_index(drop=True,inplace=True)
不幸的是,它只适用于单次撤消,而不适用于连续多次撤消。 Apply 不允许访问相邻行值,我想不出任何其他解决方案。应该也需要计算300k行,所以,性能也是个问题
预期结果是:
动作
transaction_count
0
已成交
10
1
已成交
20
2
已成交
35
3
已成交
60
4
已成交
80
5
已成交
90
提前致谢!
对这个优秀的 稍作修改可能会给你想要的东西:
解决方案
def undo(frame):
d = {"transacted": 0, "undo": 1}
condition = frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
return frame[condition==0].reset_index(drop=True)
result = df.groupby("transaction_count").apply(undo).reset_index(drop=True)
>>> result
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90
说明
groupby
用于分别处理每个 transaction_count
。
例如,考虑 transaction_count
为 10 的情况。
frame = df[df["transaction_count"]==10]
>>> frame
action transaction_count
0 transacted 10
9 transacted 10
10 undo 10
在undo
函数中,我们首先map
action
列给一个数字:
>>> frame["action"].map(d)
0 0
9 0
10 1
认识到我们要删除 0
(已处理)后紧跟着 1
(撤消)的行。在上面,这对应于索引为 9
和 10
.
的行
为此,我们使用 pd.rolling
和 lambda
:
一次处理上述帧的 2 行
>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1])
0 NaN
9 0.0
10 1.0
现在,mask
0 到 np.nan
,bfill
(回填)恰好一次,fillna
与 0
。
>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
0 0.0
9 1.0
10 1.0
Name: action, dtype: float64
从上面,我们需要所有不等于 1
的行。
这是 undo
函数返回的内容。
如果transaction count
在一个块中是唯一的。可以使用以下方式创建组:
df['group'] = (df['action'].eq('transacted') &
df['action'].shift().eq('undo')).cumsum()
action transaction_count group
0 transacted 10 0
1 transacted 20 0
2 transacted 35 0
3 transacted 60 0
4 undo 60 0
5 transacted 60 1
6 transacted 80 1
7 transacted 90 1
8 transacted 100 1
9 transacted 10 1
10 undo 10 1
11 undo 100 1
12 undo 90 1
13 transacted 90 2
然后 drop_duplicates
可用于删除重复的 transaction_count
每个 group
:
df = (df.drop_duplicates(['transaction_count', 'group'], keep=False)
.drop('group', axis=1)
.reset_index(drop=True))
df
:
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90
如果没有,可以使用反向路径创建组,以创建关联交易和撤消的类似 ID:
s = df['action'].shift()
m = df['action'].ne(s).cumsum()
df['group'] = (df['action'].eq('transacted') & s.eq('undo')).cumsum()
df['new'] = (
df.groupby(['action', m]).cumcount()
.mask(df['action'].eq('transacted'),
df.loc[::-1].groupby(['action', m]).cumcount())
)
action transaction_count group new
0 transacted 10 0 3
1 transacted 20 0 2
2 transacted 35 0 1
3 transacted 60 0 0
4 undo 60 0 0
5 transacted 60 1 4
6 transacted 80 1 3
7 transacted 90 1 2 # Matches Undo 2
8 transacted 100 1 1 # Matches Undo 1
9 transacted 10 1 0 # Matches Undo 0
10 undo 10 1 0 # Undo 0
11 undo 100 1 1 # Undo 1
12 undo 90 1 2 # Undo 2
13 transacted 90 2 0
然后可以在group
和new
之间删除重复项:
df = (df
.drop_duplicates(['group', 'new'], keep=False)
.drop(['group', 'new'], axis=1)
.reset_index(drop=True))
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90
df['is_undone']=0
for k, v in df.groupby((df['action'].shift() != df['action']).cumsum()):
if v['action'].max()=='undo':
df.loc[v['action'].index[0]-v['action'].count():v['action'].index[v['action'].count()-1],'is_undone']=1
df=df.loc[df['is_undone']==0]
df.drop('is_undone',axis=1,inplace=True)
df.reset_index(drop=True,inplace=True)
我发现 this article 将连续的代码块分组。代码运行不到一秒,这对我来说已经足够了。
我得到了一个事务性操作,它生成如下提要:
df = pd.DataFrame({'action':['transacted','transacted','transacted','transacted','undo','transacted','transacted','transacted','transacted','transacted','undo','undo','undo','transacted'],
'transaction_count':10,20,35,60,60,60,80,90,100,10,10,100,90,90]})
动作 | transaction_count | |
---|---|---|
0 | 已成交 | 10 |
1 | 已成交 | 20 |
2 | 已成交 | 35 |
3 | 已成交 | 60 |
4 | 撤消 | 60 |
5 | 已成交 | 60 |
6 | 已成交 | 80 |
7 | 已成交 | 90 |
8 | 已成交 | 100 |
9 | 已成交 | 10 |
10 | 撤消 | 10 |
11 | 撤消 | 100 |
12 | 撤消 | 90 |
13 | 已成交 | 90 |
计数是有规律的,但不是线性的。 (10-20-35-60-80-90-100-10-20...)
undo 说明取消了哪个交易计数。
多次取消可以有多次撤销。
# This is an initial apply, to set it up
df['is_undone']=df.apply(lambda x: 1 if x['action']=='undo' else 0, axis=1).shift(-1)
df=df.fillna(0) # For shift
df=df.loc[df['is_undone']==0]
df=df.fillna(0)
df=df.loc[df['action']!='undo']
df.reset_index(drop=True,inplace=True)
不幸的是,它只适用于单次撤消,而不适用于连续多次撤消。 Apply 不允许访问相邻行值,我想不出任何其他解决方案。应该也需要计算300k行,所以,性能也是个问题
预期结果是:
动作 | transaction_count | |
---|---|---|
0 | 已成交 | 10 |
1 | 已成交 | 20 |
2 | 已成交 | 35 |
3 | 已成交 | 60 |
4 | 已成交 | 80 |
5 | 已成交 | 90 |
提前致谢!
对这个优秀的
解决方案
def undo(frame):
d = {"transacted": 0, "undo": 1}
condition = frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
return frame[condition==0].reset_index(drop=True)
result = df.groupby("transaction_count").apply(undo).reset_index(drop=True)
>>> result
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90
说明
groupby
用于分别处理每个 transaction_count
。
例如,考虑 transaction_count
为 10 的情况。
frame = df[df["transaction_count"]==10]
>>> frame
action transaction_count
0 transacted 10
9 transacted 10
10 undo 10
在undo
函数中,我们首先map
action
列给一个数字:
>>> frame["action"].map(d)
0 0
9 0
10 1
认识到我们要删除 0
(已处理)后紧跟着 1
(撤消)的行。在上面,这对应于索引为 9
和 10
.
为此,我们使用 pd.rolling
和 lambda
:
>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1])
0 NaN
9 0.0
10 1.0
现在,mask
0 到 np.nan
,bfill
(回填)恰好一次,fillna
与 0
。
>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
0 0.0
9 1.0
10 1.0
Name: action, dtype: float64
从上面,我们需要所有不等于 1
的行。
这是 undo
函数返回的内容。
如果transaction count
在一个块中是唯一的。可以使用以下方式创建组:
df['group'] = (df['action'].eq('transacted') &
df['action'].shift().eq('undo')).cumsum()
action transaction_count group
0 transacted 10 0
1 transacted 20 0
2 transacted 35 0
3 transacted 60 0
4 undo 60 0
5 transacted 60 1
6 transacted 80 1
7 transacted 90 1
8 transacted 100 1
9 transacted 10 1
10 undo 10 1
11 undo 100 1
12 undo 90 1
13 transacted 90 2
然后 drop_duplicates
可用于删除重复的 transaction_count
每个 group
:
df = (df.drop_duplicates(['transaction_count', 'group'], keep=False)
.drop('group', axis=1)
.reset_index(drop=True))
df
:
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90
如果没有,可以使用反向路径创建组,以创建关联交易和撤消的类似 ID:
s = df['action'].shift()
m = df['action'].ne(s).cumsum()
df['group'] = (df['action'].eq('transacted') & s.eq('undo')).cumsum()
df['new'] = (
df.groupby(['action', m]).cumcount()
.mask(df['action'].eq('transacted'),
df.loc[::-1].groupby(['action', m]).cumcount())
)
action transaction_count group new
0 transacted 10 0 3
1 transacted 20 0 2
2 transacted 35 0 1
3 transacted 60 0 0
4 undo 60 0 0
5 transacted 60 1 4
6 transacted 80 1 3
7 transacted 90 1 2 # Matches Undo 2
8 transacted 100 1 1 # Matches Undo 1
9 transacted 10 1 0 # Matches Undo 0
10 undo 10 1 0 # Undo 0
11 undo 100 1 1 # Undo 1
12 undo 90 1 2 # Undo 2
13 transacted 90 2 0
然后可以在group
和new
之间删除重复项:
df = (df
.drop_duplicates(['group', 'new'], keep=False)
.drop(['group', 'new'], axis=1)
.reset_index(drop=True))
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90
df['is_undone']=0
for k, v in df.groupby((df['action'].shift() != df['action']).cumsum()):
if v['action'].max()=='undo':
df.loc[v['action'].index[0]-v['action'].count():v['action'].index[v['action'].count()-1],'is_undone']=1
df=df.loc[df['is_undone']==0]
df.drop('is_undone',axis=1,inplace=True)
df.reset_index(drop=True,inplace=True)
我发现 this article 将连续的代码块分组。代码运行不到一秒,这对我来说已经足够了。