基于条件的自定义标志

Question

我有一个数据集

id	ref	name	conditionCol
1	123	a	no_error
1	456	b	error
1	789	c	no_error
2	231	d	no_error
2	312	e	no_error
2	546	f	no_error
3	645	g	error
3	879	h	error
4	789	i	no_error
4	978	j	no_error

我正在尝试创建自定义 error_flag，条件是：

对于每个唯一的 id 列元素
如果 conditionCol 中的任何行具有关键字 error，则
每一行都应在 error_flag

yes

如果对于 id 列中的任何元素
甚至没有一行在 conditionCol 列中包含关键字 error，然后
每一行都应在 error_flag

no

例如对于id:1，error_flag的所有值都是yes，对于id值1，conditionCol的第2行有error

id	ref	name	conditionCol	error_flag
1	123	a	no_error	yes
1	456	b	error	yes
1	789	c	no_error	yes

但是，对于id:2，error_flag的所有值都是no，对于id值2，conditionCol的行都没有error

id	ref	name	conditionCol	error_flag
2	231	d	no_error	no
2	312	e	no_error	no
2	546	f	no_error	no

与 id 值 3 和 4 类似：

id	ref	name	conditionCol	error_flag
3	645	g	no_error	no
3	879	h	no_error	no
4	789	i	error	yes
4	978	j	error	yes

最终输出为：

id	ref	name	conditionCol	error_flag
1	123	a	no_error	yes
1	456	b	error	yes
1	789	c	no_error	yes
2	231	d	no_error	no
2	312	e	no_error	no
2	546	f	no_error	no
3	645	g	no_error	no
3	879	h	no_error	no
4	789	i	error	yes
4	978	j	error	yes

更新:

如果您想玩转数据集：

import pandas as pd
import numpy as np

id_col = [1,1,1,2,2,2,3,3,4,4]
ref_col = [123,456, 789, 231, 312, 546, 645, 879, 789, 978]
name_col = ['a','b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
conditionCol = ['no_error', 'error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'error', 'error']
df = pd.DataFrame(zip(id_col, ref_col, name_col, conditionCol), columns=['id','ref','name','conditionCol'])
df

update2：有没有办法处理阈值，即：

当前问题：对于每个唯一的 ids，在 conditionCol 列中关键字 error 至少出现一次，那么 error_flag 中的值将是 yes该 id 值中的行
对于唯一的 ids，关键字 error 在 conditionCol 列中至少出现 4 次或至少 5 次，那么只有 error_flag 中的值对于所有 yes该 id 值中的行

Answer 1

使用 numpy.where 测试每组是否至少有一个值 error id:

m = df['id'].isin(df.loc[df['conditionCol'].eq('error'), 'id'])
#alternative
#m = df['conditionCol'].eq('error').groupby(df['id']).transform('any')
df['error_flag'] = np.where(m, 'yes', 'no')

print (df)
   id  ref name conditionCol error_flag
0   1  123    a     no_error        yes
1   1  456    b        error        yes
2   1  789    c     no_error        yes
3   2  231    d     no_error         no
4   2  312    e     no_error         no
5   2  546    f     no_error         no
6   3  645    g     no_error         no
7   3  879    h     no_error         no
8   4  789    i        error        yes
9   4  978    j        error        yes

基于条件的自定义标志

Condition Based Custom Flag

eda

python-3.x

pandas

data-wrangling