根据另一列的日期和标志过滤掉行

Question

使用 python pandas DataFrame df：

Customer | date_transaction_id    | first_flag | dollars
ABC        2015-10-11-123              Y         100
BCD        2015-03-05-872              N         150
BCD        2015-01-01-923              N         -300
ABC        2015-04-04-910              N         -100
ABC        2015-12-12-765              N         -100

上面的客户ABC在4月返回属性，然后在11月买了东西。在我的分析中，我需要开始将他们的第一笔积极交易算作他们与公司的第一笔交易。如何排除客户 ABC 的第一笔交易？请注意，客户端 BCD 不是新客户端，因此不应排除任何行。

那么如何排除日期在 first_flag Y 之前的交易？

首先，我从 date_transaction_id 中获取日期并将其格式化为日期字段。

df['date'] = df['date_transaction_id'].astype(str).str[:10]
df['date']= pd.to_datetime(df['date'], format='%Y-%m-%d')

然后我会按客户和日期排序

df = df.sort_values(['Customer', 'date'], ascending=[True, False])

但现在我卡住了，如何删除日期在 first_flag 之前的客户行是 Y。请注意，客户在交易前可以有一个、none 或多个交易标记为 Y.

这是我正在寻找的输出：

Customer | date       | first_flag | dollars
ABC        2015-10-11      Y         100
ABC        2015-12-12      N         -100
BCD        2015-01-01      N         -300
BCD        2015-03-05      N         150

Answer 1

这是一个函数，它对 groupby 对象的每个组进行操作。

def drop_before(obj):
    # Get the date when first_flag == 'Y'
    y_date = obj[obj.first_flag == 'Y'].date
    if not y_date.values:
        # If it doesn't exist, just return the DataFrame
        return obj
    else:
        # Find where your two drop conditions are satisfied
        cond1 = obj.date < y_date[0]
        cond2 = abc.first_flag == 'N'
        to_drop = obj[cond1 & cond2].index
        return obj.drop(to_drop)

res = df.groupby('Customer').apply(drop_before)
res = res.set_index(res.index.get_level_values(1))
print(res)
  Customer date_transaction_id first_flag  dollars       date
4      ABC      2015-12-12-765          N     -100 2015-12-12
0      ABC      2015-10-11-123          Y      100 2015-10-11
1      BCD      2015-03-05-872          N      150 2015-03-05
2      BCD      2015-01-01-923          N     -300 2015-01-01

因此，为每个 group 客户绘制单独的 DataFrame（1 个用于 ABC，1 个用于 BCD）。然后，当您使用 apply 时，drop_before 将应用于每个子帧，然后将它们重新组合。

这假设每个客户最多只有一个 first_flag == 'Y'。您的问题似乎就是这种情况。

Answer 2

# Convert `date_transaction_id` to date timestamp.
df = df.assign(transaction_date=pd.to_datetime(df['date_transaction_id'].str[:10]))

# Find first transaction dates by customer.
first_transactions = (
    df[df['first_flag'] == 'Y']
    .groupby(['Customer'], as_index=False)['transaction_date']
    .min())

# Merge first transaction date to dataframe.
df = df.merge(first_transactions, how='left', on='Customer', suffixes=['', '_first'])

# Filter data and select relevant columns.
>>> (df[df['transaction_date'] >= df['transaction_date_first']]
     .sort_values(['Customer', 'transaction_date'])
     [['Customer', 'transaction_date', 'first_flag', 'dollars']])
  Customer transaction_date first_flag  dollars
0      ABC       2015-10-11          Y      100
4      ABC       2015-12-12          N     -100
2      BCD       2015-01-01          N     -300
1      BCD       2015-03-05          N      150

Answer 3

回答您排除具有标志 'Y' 的客户的第一笔交易的第一个问题：

import pandas as pd
df = pd.DataFrame([['ABC','2015-10-11','Y',100],
                  ['BCD','2015-03-05','N',150],
                  ['BCD','2015-01-01','N',-300],
                  ['ABC','2015-04-04','N',-100],
                  ['ABC','2015-12-12','N', -100]], 
                  columns=['Customer','date', 'first_flag','dollars'])

# Extract the original columns
cols = df.columns

# Create a label column of whether the customer has a 'Y' flag
df['is_new'] = df.groupby('Customer')['first_flag'].transform('max')

# Apply the desired function, ie. dropping the first transaction
# to the matching records, drop index columns in the end

new_customers = (df[df['is_new'] == 'Y']
                 .sort_values(by=['Customer','date'])
                 .groupby('Customer',as_index=False)
                 .apply(lambda x: x.iloc[1:]).reset_index()
                 [cols])

# Extract the rest
old_customers = df[df['is_new'] != 'Y']

# Concat the transformed and untouched records together
pd.concat([new_customers, old_customers])[cols]

输出：

Customer | date       | first_flag | dollars
ABC        2015-10-11      Y         100
ABC        2015-12-12      N         -100
BCD        2015-01-01      N         -300
BCD        2015-03-05      N         150

Answer 4

df 

  Customer date_transaction_id first_flag  dollars
0      ABC      2015-10-11-123          Y      100
1      BCD      2015-03-05-872          N      150
2      BCD      2015-01-01-923          N     -300
3      ABC      2015-04-04-910          N     -100
4      ABC      2015-12-12-765          N     -100

df['date']= pd.to_datetime(df['date_transaction_id']\
                            .astype(str).str[:10], format='%Y-%m-%d')
df = df.sort_values(['Customer', 'date'])\
                            .drop('date_transaction_id', 1)

df 

  Customer first_flag  dollars       date
3      ABC          N     -100 2015-04-04
0      ABC          Y      100 2015-10-11
4      ABC          N     -100 2015-12-12
2      BCD          N     -300 2015-01-01
1      BCD          N      150 2015-03-05

首先将 first_flag 替换为整数值。

df.first_flag = df.first_flag.replace({'N' : 0, 'Y' : 1})

现在，groupby 在 Customer 上检查 cumsum wrt first_flag 的 max。

df = df.groupby('Customer')[['date', 'first_flag', 'dollars']]\
                .apply(lambda x: x[x.first_flag.cumsum() == x.first_flag.max()])\
                .reset_index(level=0)
df
  Customer       date  first_flag  dollars
0      ABC 2015-10-11           1      100
4      ABC 2015-12-12           0     -100
2      BCD 2015-01-01           0     -300
1      BCD 2015-03-05           0      150

可选：使用

将整数值替换为旧的Y/N

df.first_flag = df.first_flag.replace({0 : 'N', 1 : 'Y'})  
df

  Customer       date first_flag  dollars
0      ABC 2015-10-11          Y      100
4      ABC 2015-12-12          N     -100
2      BCD 2015-01-01          N     -300
1      BCD 2015-03-05          N      150

Answer 5

所有预设都和cᴏʟᴅsᴘᴇᴇᴅ的回答一样，在我的回答中，我使用idxmax

预设

df['date']= pd.to_datetime(df['date_transaction_id']\
                            .astype(str).str[:10], format='%Y-%m-%d')
df=df.sort_values(['Customer','date']).replace({'N' : 0, 'Y' : 1}).reset_index(drop=True)

L=df.groupby('Customer')['first_flag'].apply(lambda x : x.index>=x.idxmax()).apply(list).values.tolist()
import functools
L=functools.reduce(lambda x,y: x+y,L)
df[L]


Out[278]: 
  Customer date_transaction_id  first_flag  dollars       date
1      ABC      2015-10-11-123           1      100 2015-10-11
2      ABC      2015-12-12-765           0     -100 2015-12-12
3      BCD      2015-01-01-923           0     -300 2015-01-01
4      BCD      2015-03-05-872           0      150 2015-03-05

根据另一列的日期和标志过滤掉行

Filter out rows based on dates and flags from another column

python

group-by

dataframe

pandas

pandas-groupby