Python pandas if语句基于两个条件
Python pandas if statement based on two conditions
小猪回避这个问题
#code to re-create my example date
df = pd.DataFrame({'customer_id': ['abc','abc','xyz','xyz','xyz','xyz','thr','thr','abc','abc','urt','urt'],
'transaction_id': ['A123','A123','B345','B345','C567','C567','D678','D678','E789','E789','D903','F865'],
'product_id': [255472, 251235, 253764,257344,221577,209809,223551,290678,908354,909238,436758,346577],
'product_category': ['X','X','Y','Y','X','Y','Y','X','Y','Z','X','X']})
#example data
customer_id| transaction_id | product_id | product_category
abc A123 255472 X
abc A123 251235 X
xyz B345 253764 Y
xyz B345 257344 Y
xyz C567 221577 X
xyz C567 209809 Y
thr D678 223551 Y
thr D678 290678 X
abc E789 908354 Y
abc E789 909238 Z
urt D903 436758 X
urt F865 346577 X
我想标记所有 customer_ids 在不同交易中(不在同一交易中)具有 X 和 Y 的交易。
#expected output
customer_id| transaction_id | product_id | product_category | flag
abc A123 255472 X 1
abc A123 251235 X 1
xyz B345 253764 Y 0
xyz B345 257344 Y 0
xyz C567 221577 X 0
xyz C567 209809 Y 0
thr D678 223551 Y 0
thr D678 290678 X 0
abc E789 908354 Y 1
abc E789 909238 Z 1
urt D903 436758 X 0
urt F865 346577 X 0
我想不出一个干净的解决方案。在上面的示例中,我们有客户 abc,他只与产品类别 X 进行交易,然后与产品类别 Y 和 Z 进行交易。这是我要标记的客户,他们有 X 和 Y,但在不同的 transaction_ids.
我想到的一种方法是使用我之前回答中的代码:
df['pre_flag']=df.groupby('transaction_id')['product_category'].transform(lambda x: x + ' only' if len(set(x)) < 2 else ' & '.join(set(x)))
然后将数据集一分为二:
df_1 = df.loc[df['pre_flag'] == 'X&Y'].copy()
df_2 = df.loc[df['pre_flag'] != 'X&Y'].copy()
... 并使用 isin 语句;但这很乱;必须有更好的方法。谢谢!
这是使用 groupby
和 pd.Series.apply
的一种方式。
df = pd.DataFrame({'customer_id': ['abc','abc','xyz','xyz','xyz','xyz','thr','thr','abc','abc','urt','urt'],
'transaction_id': ['A123','A123','B345','B345','C567','C567','D678','D678','E789','E789','D903','F865'],
'product_id': [255472, 251235, 253764,257344,221577,209809,223551,290678,908354,909238,436758,346577],
'product_category': ['X','X','Y','Y','X','Y','Y','X','Y','Z','X','X']})
g = df.groupby(['customer_id', 'transaction_id'])['product_category']\
.apply(lambda x: {i for i in x if i in ('X', 'Y')}).reset_index()
g2 = g.groupby('customer_id')['product_category']\
.apply(list).apply(lambda x: ({'X'} in x) and ({'Y'} in x))
print(g2)
# customer_id
# abc True
# thr False
# urt False
# xyz False
# Name: product_category, dtype: bool
df['flag'] = df['customer_id'].isin(g2[g2].index)
print(df)
# customer_id product_category product_id transaction_id flag
# 0 abc X 255472 A123 True
# 1 abc X 251235 A123 True
# 2 xyz Y 253764 B345 False
# 3 xyz Y 257344 B345 False
# 4 xyz X 221577 C567 False
# 5 xyz Y 209809 C567 False
# 6 thr Y 223551 D678 False
# 7 thr X 290678 D678 False
# 8 abc Y 908354 E789 True
# 9 abc Z 909238 E789 True
# 10 urt X 436758 D903 False
# 11 urt X 346577 F865 False
小猪回避这个问题
#code to re-create my example date
df = pd.DataFrame({'customer_id': ['abc','abc','xyz','xyz','xyz','xyz','thr','thr','abc','abc','urt','urt'],
'transaction_id': ['A123','A123','B345','B345','C567','C567','D678','D678','E789','E789','D903','F865'],
'product_id': [255472, 251235, 253764,257344,221577,209809,223551,290678,908354,909238,436758,346577],
'product_category': ['X','X','Y','Y','X','Y','Y','X','Y','Z','X','X']})
#example data
customer_id| transaction_id | product_id | product_category
abc A123 255472 X
abc A123 251235 X
xyz B345 253764 Y
xyz B345 257344 Y
xyz C567 221577 X
xyz C567 209809 Y
thr D678 223551 Y
thr D678 290678 X
abc E789 908354 Y
abc E789 909238 Z
urt D903 436758 X
urt F865 346577 X
我想标记所有 customer_ids 在不同交易中(不在同一交易中)具有 X 和 Y 的交易。
#expected output
customer_id| transaction_id | product_id | product_category | flag
abc A123 255472 X 1
abc A123 251235 X 1
xyz B345 253764 Y 0
xyz B345 257344 Y 0
xyz C567 221577 X 0
xyz C567 209809 Y 0
thr D678 223551 Y 0
thr D678 290678 X 0
abc E789 908354 Y 1
abc E789 909238 Z 1
urt D903 436758 X 0
urt F865 346577 X 0
我想不出一个干净的解决方案。在上面的示例中,我们有客户 abc,他只与产品类别 X 进行交易,然后与产品类别 Y 和 Z 进行交易。这是我要标记的客户,他们有 X 和 Y,但在不同的 transaction_ids.
我想到的一种方法是使用我之前回答中的代码:
df['pre_flag']=df.groupby('transaction_id')['product_category'].transform(lambda x: x + ' only' if len(set(x)) < 2 else ' & '.join(set(x)))
然后将数据集一分为二:
df_1 = df.loc[df['pre_flag'] == 'X&Y'].copy()
df_2 = df.loc[df['pre_flag'] != 'X&Y'].copy()
... 并使用 isin 语句;但这很乱;必须有更好的方法。谢谢!
这是使用 groupby
和 pd.Series.apply
的一种方式。
df = pd.DataFrame({'customer_id': ['abc','abc','xyz','xyz','xyz','xyz','thr','thr','abc','abc','urt','urt'],
'transaction_id': ['A123','A123','B345','B345','C567','C567','D678','D678','E789','E789','D903','F865'],
'product_id': [255472, 251235, 253764,257344,221577,209809,223551,290678,908354,909238,436758,346577],
'product_category': ['X','X','Y','Y','X','Y','Y','X','Y','Z','X','X']})
g = df.groupby(['customer_id', 'transaction_id'])['product_category']\
.apply(lambda x: {i for i in x if i in ('X', 'Y')}).reset_index()
g2 = g.groupby('customer_id')['product_category']\
.apply(list).apply(lambda x: ({'X'} in x) and ({'Y'} in x))
print(g2)
# customer_id
# abc True
# thr False
# urt False
# xyz False
# Name: product_category, dtype: bool
df['flag'] = df['customer_id'].isin(g2[g2].index)
print(df)
# customer_id product_category product_id transaction_id flag
# 0 abc X 255472 A123 True
# 1 abc X 251235 A123 True
# 2 xyz Y 253764 B345 False
# 3 xyz Y 257344 B345 False
# 4 xyz X 221577 C567 False
# 5 xyz Y 209809 C567 False
# 6 thr Y 223551 D678 False
# 7 thr X 290678 D678 False
# 8 abc Y 908354 E789 True
# 9 abc Z 909238 E789 True
# 10 urt X 436758 D903 False
# 11 urt X 346577 F865 False