迭代一个数据框以计算新功能 - Python
Iterate over one dataframe to calculate new features - Python
我正在使用包含以下列的信用卡交易数据框:
timestamp, transaction_id, buyer_id, status
我不想生成一个新列 q_app_1d
,它根据条件(相同 buyer_id
,为每个 transaction_id
计算先前 transaction_id
的数量status = 1
,timestamp
之间的差异 <= 1 天)。
我曾尝试使用自连接(也就是将数据框与自身连接)来执行此操作,但未能成功。
我知道如何在 SQL 中轻松地做到这一点,但我无法在 Pandas 中使用它。
非常感谢任何帮助或提示!
编辑:
示例输入:
timestamp, transaction_id, buyer_id, status
01/01/2020 00:00:00, 1, abc123, 1
01/01/2020 00:25:00, 2, abc123, 1
01/01/2020 00:30:00, 3, abc123, 1
01/01/2020 00:45:00, 4, def456, 1
02/01/2020 08:55:00, 5, abc123, 1
02/01/2020 10:55:00, 6, def456, 1
03/01/2020 12:55:00, 7, def456, 1
示例输出:
timestamp, transaction_id, buyer_id, status, q_app_1d
01/01/2020 00:00:00, 1, abc123, 1, 0
01/01/2020 00:25:00, 2, abc123, 1, 1 #(considers transaction_id 1)
01/01/2020 00:30:00, 3, abc123, 1, 2 #(considers transaction_id 1,2)
01/01/2020 00:45:00, 4, def456, 1, 0
02/01/2020 08:55:00, 5, abc123, 1, 0 #(more than one day since transaction_id 3)
02/01/2020 10:55:00, 6, def456, 1, 0 #(more than one day since transaction_id 4)
03/01/2020 08:55:00, 7, def456, 1, 1 #(considers transaction_id 6)
这应该有效:
df['timestamp'] = pd.to_datetime(df['timestamp'],dayfirst=True)
df = df.set_index('timestamp')
_df = (df.groupby('buyer_id')['status'].rolling('24H').count()-1).reset_index()
_df.columns = ['buyer_id','timestamp','q_app_1d']
df = df.reset_index()
df = df.merge(_df)
df.head(7)
我正在使用包含以下列的信用卡交易数据框:
timestamp, transaction_id, buyer_id, status
我不想生成一个新列 q_app_1d
,它根据条件(相同 buyer_id
,为每个 transaction_id
计算先前 transaction_id
的数量status = 1
,timestamp
之间的差异 <= 1 天)。
我曾尝试使用自连接(也就是将数据框与自身连接)来执行此操作,但未能成功。 我知道如何在 SQL 中轻松地做到这一点,但我无法在 Pandas 中使用它。
非常感谢任何帮助或提示!
编辑:
示例输入:
timestamp, transaction_id, buyer_id, status
01/01/2020 00:00:00, 1, abc123, 1
01/01/2020 00:25:00, 2, abc123, 1
01/01/2020 00:30:00, 3, abc123, 1
01/01/2020 00:45:00, 4, def456, 1
02/01/2020 08:55:00, 5, abc123, 1
02/01/2020 10:55:00, 6, def456, 1
03/01/2020 12:55:00, 7, def456, 1
示例输出:
timestamp, transaction_id, buyer_id, status, q_app_1d
01/01/2020 00:00:00, 1, abc123, 1, 0
01/01/2020 00:25:00, 2, abc123, 1, 1 #(considers transaction_id 1)
01/01/2020 00:30:00, 3, abc123, 1, 2 #(considers transaction_id 1,2)
01/01/2020 00:45:00, 4, def456, 1, 0
02/01/2020 08:55:00, 5, abc123, 1, 0 #(more than one day since transaction_id 3)
02/01/2020 10:55:00, 6, def456, 1, 0 #(more than one day since transaction_id 4)
03/01/2020 08:55:00, 7, def456, 1, 1 #(considers transaction_id 6)
这应该有效:
df['timestamp'] = pd.to_datetime(df['timestamp'],dayfirst=True)
df = df.set_index('timestamp')
_df = (df.groupby('buyer_id')['status'].rolling('24H').count()-1).reset_index()
_df.columns = ['buyer_id','timestamp','q_app_1d']
df = df.reset_index()
df = df.merge(_df)
df.head(7)