列中重复值之间的条件
Condition between duplicated values in a column
当每个客户有多个计划时,他们都是重复的。我想将状态设置为客户:
如果每个产品都填写了'canceled_at',则客户状态被取消,但是当不是每个产品都填写了canceled_at,而是至少有一个时,状态为'downgrade' 因为他丢了一个产品。
customer|canceled_at|status
x |3/27/2018 |
x | |
y |2/2/2018 |
y |2/2/2018 |
z |1/1/2018 |
a | |
我已经有取消状态,现在我只需要降级
df['status']=(df.groupby('customer')['canceled_at'].
transform(lambda x: x.notna().all()).map({True:'canceled'})).fillna(df.status)
customer|canceled_at|status
x |3/27/2018 |downgrade
x | |downgrade
y |2/2/2018 |canceled
y |2/2/2018 |canceled
z |1/1/2018 |canceled
a | |
这里可以比较没有缺失值的列,并按 Series
customer
和 GroupBy.transform
and GroupBy.all
进行分组,
GroupBy.any
for test all values True
s (all non missing) or at least one value not missing (any non missing) and pass it to numpy.select
:
g = df['canceled_at'].notna().groupby(df['customer'])
m1 = g.transform('all')
m2 = g.transform('any')
df['status'] = np.select([m1, m2],['canceled','downgrade'], np.nan)
print (df)
customer canceled_at status
0 x 3/27/2018 downgrade
1 x NaN downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a NaN nan
或:
df['status'] = np.select([m1, m2],['canceled','downgrade'], '')
print (df)
customer canceled_at status
0 x 3/27/2018 downgrade
1 x NaN downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a NaN
如果只有 NaN
s 组需要转换为 downgrade
:
mask = df['canceled_at'].notna().groupby(df['customer']).transform('all')
df['status'] = np.where(mask,'canceled','downgrade')
print (df)
customer canceled_at status
0 x 3/27/2018 downgrade
1 x NaN downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a NaN downgrade
这里有一个方法:
import pandas as pd
def select_status(canceled):
c = canceled.count()
if c == 0:
status = ''
elif c == len(canceled):
status = 'canceled'
else:
status = 'downgrade'
return pd.Series(status, index=canceled.index)
df = pd.DataFrame({'customer': ['x', 'x', 'y', 'y', 'z', 'a'],
'canceled_at': ['3/27/2018', None, '2/2/2018', '2/2/2018', '1/1/2018', None]})
df['status'] = df.groupby('customer')['canceled_at'].apply(select_status)
print(df)
输出:
customer canceled_at status
0 x 3/27/2018 downgrade
1 x None downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a None
当每个客户有多个计划时,他们都是重复的。我想将状态设置为客户:
如果每个产品都填写了'canceled_at',则客户状态被取消,但是当不是每个产品都填写了canceled_at,而是至少有一个时,状态为'downgrade' 因为他丢了一个产品。
customer|canceled_at|status
x |3/27/2018 |
x | |
y |2/2/2018 |
y |2/2/2018 |
z |1/1/2018 |
a | |
我已经有取消状态,现在我只需要降级
df['status']=(df.groupby('customer')['canceled_at'].
transform(lambda x: x.notna().all()).map({True:'canceled'})).fillna(df.status)
customer|canceled_at|status
x |3/27/2018 |downgrade
x | |downgrade
y |2/2/2018 |canceled
y |2/2/2018 |canceled
z |1/1/2018 |canceled
a | |
这里可以比较没有缺失值的列,并按 Series
customer
和 GroupBy.transform
and GroupBy.all
进行分组,
GroupBy.any
for test all values True
s (all non missing) or at least one value not missing (any non missing) and pass it to numpy.select
:
g = df['canceled_at'].notna().groupby(df['customer'])
m1 = g.transform('all')
m2 = g.transform('any')
df['status'] = np.select([m1, m2],['canceled','downgrade'], np.nan)
print (df)
customer canceled_at status
0 x 3/27/2018 downgrade
1 x NaN downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a NaN nan
或:
df['status'] = np.select([m1, m2],['canceled','downgrade'], '')
print (df)
customer canceled_at status
0 x 3/27/2018 downgrade
1 x NaN downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a NaN
如果只有 NaN
s 组需要转换为 downgrade
:
mask = df['canceled_at'].notna().groupby(df['customer']).transform('all')
df['status'] = np.where(mask,'canceled','downgrade')
print (df)
customer canceled_at status
0 x 3/27/2018 downgrade
1 x NaN downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a NaN downgrade
这里有一个方法:
import pandas as pd
def select_status(canceled):
c = canceled.count()
if c == 0:
status = ''
elif c == len(canceled):
status = 'canceled'
else:
status = 'downgrade'
return pd.Series(status, index=canceled.index)
df = pd.DataFrame({'customer': ['x', 'x', 'y', 'y', 'z', 'a'],
'canceled_at': ['3/27/2018', None, '2/2/2018', '2/2/2018', '1/1/2018', None]})
df['status'] = df.groupby('customer')['canceled_at'].apply(select_status)
print(df)
输出:
customer canceled_at status
0 x 3/27/2018 downgrade
1 x None downgrade
2 y 2/2/2018 canceled
3 y 2/2/2018 canceled
4 z 1/1/2018 canceled
5 a None