根据另一列的日期和标志过滤掉行
Filter out rows based on dates and flags from another column
使用 python pandas DataFrame df
:
Customer | date_transaction_id | first_flag | dollars
ABC 2015-10-11-123 Y 100
BCD 2015-03-05-872 N 150
BCD 2015-01-01-923 N -300
ABC 2015-04-04-910 N -100
ABC 2015-12-12-765 N -100
上面的客户ABC在4月返回属性,然后在11月买了东西。在我的分析中,我需要开始将他们的第一笔积极交易算作他们与公司的第一笔交易。如何排除客户 ABC 的第一笔交易?请注意,客户端 BCD 不是新客户端,因此不应排除任何行。
那么如何排除日期在 first_flag
Y 之前的交易?
首先,我从 date_transaction_id 中获取日期并将其格式化为日期字段。
df['date'] = df['date_transaction_id'].astype(str).str[:10]
df['date']= pd.to_datetime(df['date'], format='%Y-%m-%d')
然后我会按客户和日期排序
df = df.sort_values(['Customer', 'date'], ascending=[True, False])
但现在我卡住了,如何删除日期在 first_flag
之前的客户行是 Y。请注意,客户在交易前可以有一个、none 或多个交易标记为 Y.
这是我正在寻找的输出:
Customer | date | first_flag | dollars
ABC 2015-10-11 Y 100
ABC 2015-12-12 N -100
BCD 2015-01-01 N -300
BCD 2015-03-05 N 150
这是一个函数,它对 groupby 对象的每个组进行操作。
def drop_before(obj):
# Get the date when first_flag == 'Y'
y_date = obj[obj.first_flag == 'Y'].date
if not y_date.values:
# If it doesn't exist, just return the DataFrame
return obj
else:
# Find where your two drop conditions are satisfied
cond1 = obj.date < y_date[0]
cond2 = abc.first_flag == 'N'
to_drop = obj[cond1 & cond2].index
return obj.drop(to_drop)
res = df.groupby('Customer').apply(drop_before)
res = res.set_index(res.index.get_level_values(1))
print(res)
Customer date_transaction_id first_flag dollars date
4 ABC 2015-12-12-765 N -100 2015-12-12
0 ABC 2015-10-11-123 Y 100 2015-10-11
1 BCD 2015-03-05-872 N 150 2015-03-05
2 BCD 2015-01-01-923 N -300 2015-01-01
因此,为每个 group 客户绘制单独的 DataFrame(1 个用于 ABC,1 个用于 BCD)。然后,当您使用 apply
时,drop_before
将应用于每个子帧,然后将它们重新组合。
这假设每个客户最多只有一个 first_flag == 'Y'
。您的问题似乎就是这种情况。
# Convert `date_transaction_id` to date timestamp.
df = df.assign(transaction_date=pd.to_datetime(df['date_transaction_id'].str[:10]))
# Find first transaction dates by customer.
first_transactions = (
df[df['first_flag'] == 'Y']
.groupby(['Customer'], as_index=False)['transaction_date']
.min())
# Merge first transaction date to dataframe.
df = df.merge(first_transactions, how='left', on='Customer', suffixes=['', '_first'])
# Filter data and select relevant columns.
>>> (df[df['transaction_date'] >= df['transaction_date_first']]
.sort_values(['Customer', 'transaction_date'])
[['Customer', 'transaction_date', 'first_flag', 'dollars']])
Customer transaction_date first_flag dollars
0 ABC 2015-10-11 Y 100
4 ABC 2015-12-12 N -100
2 BCD 2015-01-01 N -300
1 BCD 2015-03-05 N 150
回答您排除具有标志 'Y' 的客户的第一笔交易的第一个问题:
import pandas as pd
df = pd.DataFrame([['ABC','2015-10-11','Y',100],
['BCD','2015-03-05','N',150],
['BCD','2015-01-01','N',-300],
['ABC','2015-04-04','N',-100],
['ABC','2015-12-12','N', -100]],
columns=['Customer','date', 'first_flag','dollars'])
# Extract the original columns
cols = df.columns
# Create a label column of whether the customer has a 'Y' flag
df['is_new'] = df.groupby('Customer')['first_flag'].transform('max')
# Apply the desired function, ie. dropping the first transaction
# to the matching records, drop index columns in the end
new_customers = (df[df['is_new'] == 'Y']
.sort_values(by=['Customer','date'])
.groupby('Customer',as_index=False)
.apply(lambda x: x.iloc[1:]).reset_index()
[cols])
# Extract the rest
old_customers = df[df['is_new'] != 'Y']
# Concat the transformed and untouched records together
pd.concat([new_customers, old_customers])[cols]
输出:
Customer | date | first_flag | dollars
ABC 2015-10-11 Y 100
ABC 2015-12-12 N -100
BCD 2015-01-01 N -300
BCD 2015-03-05 N 150
df
Customer date_transaction_id first_flag dollars
0 ABC 2015-10-11-123 Y 100
1 BCD 2015-03-05-872 N 150
2 BCD 2015-01-01-923 N -300
3 ABC 2015-04-04-910 N -100
4 ABC 2015-12-12-765 N -100
df['date']= pd.to_datetime(df['date_transaction_id']\
.astype(str).str[:10], format='%Y-%m-%d')
df = df.sort_values(['Customer', 'date'])\
.drop('date_transaction_id', 1)
df
Customer first_flag dollars date
3 ABC N -100 2015-04-04
0 ABC Y 100 2015-10-11
4 ABC N -100 2015-12-12
2 BCD N -300 2015-01-01
1 BCD N 150 2015-03-05
首先将 first_flag
替换为整数值。
df.first_flag = df.first_flag.replace({'N' : 0, 'Y' : 1})
现在,groupby
在 Customer
上检查 cumsum
wrt first_flag
的 max
。
df = df.groupby('Customer')[['date', 'first_flag', 'dollars']]\
.apply(lambda x: x[x.first_flag.cumsum() == x.first_flag.max()])\
.reset_index(level=0)
df
Customer date first_flag dollars
0 ABC 2015-10-11 1 100
4 ABC 2015-12-12 0 -100
2 BCD 2015-01-01 0 -300
1 BCD 2015-03-05 0 150
可选:使用
将整数值替换为旧的Y
/N
df.first_flag = df.first_flag.replace({0 : 'N', 1 : 'Y'})
df
Customer date first_flag dollars
0 ABC 2015-10-11 Y 100
4 ABC 2015-12-12 N -100
2 BCD 2015-01-01 N -300
1 BCD 2015-03-05 N 150
所有预设都和cᴏʟᴅsᴘᴇᴇᴅ的回答一样,在我的回答中,我使用idxmax
预设
df['date']= pd.to_datetime(df['date_transaction_id']\
.astype(str).str[:10], format='%Y-%m-%d')
df=df.sort_values(['Customer','date']).replace({'N' : 0, 'Y' : 1}).reset_index(drop=True)
L=df.groupby('Customer')['first_flag'].apply(lambda x : x.index>=x.idxmax()).apply(list).values.tolist()
import functools
L=functools.reduce(lambda x,y: x+y,L)
df[L]
Out[278]:
Customer date_transaction_id first_flag dollars date
1 ABC 2015-10-11-123 1 100 2015-10-11
2 ABC 2015-12-12-765 0 -100 2015-12-12
3 BCD 2015-01-01-923 0 -300 2015-01-01
4 BCD 2015-03-05-872 0 150 2015-03-05
使用 python pandas DataFrame df
:
Customer | date_transaction_id | first_flag | dollars
ABC 2015-10-11-123 Y 100
BCD 2015-03-05-872 N 150
BCD 2015-01-01-923 N -300
ABC 2015-04-04-910 N -100
ABC 2015-12-12-765 N -100
上面的客户ABC在4月返回属性,然后在11月买了东西。在我的分析中,我需要开始将他们的第一笔积极交易算作他们与公司的第一笔交易。如何排除客户 ABC 的第一笔交易?请注意,客户端 BCD 不是新客户端,因此不应排除任何行。
那么如何排除日期在 first_flag
Y 之前的交易?
首先,我从 date_transaction_id 中获取日期并将其格式化为日期字段。
df['date'] = df['date_transaction_id'].astype(str).str[:10]
df['date']= pd.to_datetime(df['date'], format='%Y-%m-%d')
然后我会按客户和日期排序
df = df.sort_values(['Customer', 'date'], ascending=[True, False])
但现在我卡住了,如何删除日期在 first_flag
之前的客户行是 Y。请注意,客户在交易前可以有一个、none 或多个交易标记为 Y.
这是我正在寻找的输出:
Customer | date | first_flag | dollars
ABC 2015-10-11 Y 100
ABC 2015-12-12 N -100
BCD 2015-01-01 N -300
BCD 2015-03-05 N 150
这是一个函数,它对 groupby 对象的每个组进行操作。
def drop_before(obj):
# Get the date when first_flag == 'Y'
y_date = obj[obj.first_flag == 'Y'].date
if not y_date.values:
# If it doesn't exist, just return the DataFrame
return obj
else:
# Find where your two drop conditions are satisfied
cond1 = obj.date < y_date[0]
cond2 = abc.first_flag == 'N'
to_drop = obj[cond1 & cond2].index
return obj.drop(to_drop)
res = df.groupby('Customer').apply(drop_before)
res = res.set_index(res.index.get_level_values(1))
print(res)
Customer date_transaction_id first_flag dollars date
4 ABC 2015-12-12-765 N -100 2015-12-12
0 ABC 2015-10-11-123 Y 100 2015-10-11
1 BCD 2015-03-05-872 N 150 2015-03-05
2 BCD 2015-01-01-923 N -300 2015-01-01
因此,为每个 group 客户绘制单独的 DataFrame(1 个用于 ABC,1 个用于 BCD)。然后,当您使用 apply
时,drop_before
将应用于每个子帧,然后将它们重新组合。
这假设每个客户最多只有一个 first_flag == 'Y'
。您的问题似乎就是这种情况。
# Convert `date_transaction_id` to date timestamp.
df = df.assign(transaction_date=pd.to_datetime(df['date_transaction_id'].str[:10]))
# Find first transaction dates by customer.
first_transactions = (
df[df['first_flag'] == 'Y']
.groupby(['Customer'], as_index=False)['transaction_date']
.min())
# Merge first transaction date to dataframe.
df = df.merge(first_transactions, how='left', on='Customer', suffixes=['', '_first'])
# Filter data and select relevant columns.
>>> (df[df['transaction_date'] >= df['transaction_date_first']]
.sort_values(['Customer', 'transaction_date'])
[['Customer', 'transaction_date', 'first_flag', 'dollars']])
Customer transaction_date first_flag dollars
0 ABC 2015-10-11 Y 100
4 ABC 2015-12-12 N -100
2 BCD 2015-01-01 N -300
1 BCD 2015-03-05 N 150
回答您排除具有标志 'Y' 的客户的第一笔交易的第一个问题:
import pandas as pd
df = pd.DataFrame([['ABC','2015-10-11','Y',100],
['BCD','2015-03-05','N',150],
['BCD','2015-01-01','N',-300],
['ABC','2015-04-04','N',-100],
['ABC','2015-12-12','N', -100]],
columns=['Customer','date', 'first_flag','dollars'])
# Extract the original columns
cols = df.columns
# Create a label column of whether the customer has a 'Y' flag
df['is_new'] = df.groupby('Customer')['first_flag'].transform('max')
# Apply the desired function, ie. dropping the first transaction
# to the matching records, drop index columns in the end
new_customers = (df[df['is_new'] == 'Y']
.sort_values(by=['Customer','date'])
.groupby('Customer',as_index=False)
.apply(lambda x: x.iloc[1:]).reset_index()
[cols])
# Extract the rest
old_customers = df[df['is_new'] != 'Y']
# Concat the transformed and untouched records together
pd.concat([new_customers, old_customers])[cols]
输出:
Customer | date | first_flag | dollars
ABC 2015-10-11 Y 100
ABC 2015-12-12 N -100
BCD 2015-01-01 N -300
BCD 2015-03-05 N 150
df
Customer date_transaction_id first_flag dollars
0 ABC 2015-10-11-123 Y 100
1 BCD 2015-03-05-872 N 150
2 BCD 2015-01-01-923 N -300
3 ABC 2015-04-04-910 N -100
4 ABC 2015-12-12-765 N -100
df['date']= pd.to_datetime(df['date_transaction_id']\
.astype(str).str[:10], format='%Y-%m-%d')
df = df.sort_values(['Customer', 'date'])\
.drop('date_transaction_id', 1)
df
Customer first_flag dollars date
3 ABC N -100 2015-04-04
0 ABC Y 100 2015-10-11
4 ABC N -100 2015-12-12
2 BCD N -300 2015-01-01
1 BCD N 150 2015-03-05
首先将 first_flag
替换为整数值。
df.first_flag = df.first_flag.replace({'N' : 0, 'Y' : 1})
现在,groupby
在 Customer
上检查 cumsum
wrt first_flag
的 max
。
df = df.groupby('Customer')[['date', 'first_flag', 'dollars']]\
.apply(lambda x: x[x.first_flag.cumsum() == x.first_flag.max()])\
.reset_index(level=0)
df
Customer date first_flag dollars
0 ABC 2015-10-11 1 100
4 ABC 2015-12-12 0 -100
2 BCD 2015-01-01 0 -300
1 BCD 2015-03-05 0 150
可选:使用
将整数值替换为旧的Y
/N
df.first_flag = df.first_flag.replace({0 : 'N', 1 : 'Y'})
df
Customer date first_flag dollars
0 ABC 2015-10-11 Y 100
4 ABC 2015-12-12 N -100
2 BCD 2015-01-01 N -300
1 BCD 2015-03-05 N 150
所有预设都和cᴏʟᴅsᴘᴇᴇᴅ的回答一样,在我的回答中,我使用idxmax
预设
df['date']= pd.to_datetime(df['date_transaction_id']\
.astype(str).str[:10], format='%Y-%m-%d')
df=df.sort_values(['Customer','date']).replace({'N' : 0, 'Y' : 1}).reset_index(drop=True)
L=df.groupby('Customer')['first_flag'].apply(lambda x : x.index>=x.idxmax()).apply(list).values.tolist()
import functools
L=functools.reduce(lambda x,y: x+y,L)
df[L]
Out[278]:
Customer date_transaction_id first_flag dollars date
1 ABC 2015-10-11-123 1 100 2015-10-11
2 ABC 2015-12-12-765 0 -100 2015-12-12
3 BCD 2015-01-01-923 0 -300 2015-01-01
4 BCD 2015-03-05-872 0 150 2015-03-05