Pandas 没有合并的索引匹配

Question

所以我有两个数据框。 activity_log 根据 client_id 记录客户端登录的时间。如果客户端在一段时间内多次登录，client_id 可能会出现多次。

我需要在此 activity_log 中创建第三列，以查找创建客户端的日期。此 created_date 是通过查看 user_table 中最早的 user_created 日期计算得出的。

activity_log

client_id	activity_date	created_date
1	12/12/2022
1	11/12/2022
1	9/12/2022
1	8/12/2022
2	12/12/2022
2	11/12/2022
3	10/12/2022
3	9/12/2022

user_table

client_id	user_id	user_created
1	12asdasd3	12/12/2021
1	1sads23	11/12/2021
1	asasdsa2	10/12/2021
2	32asdasd1	12/12/2021
2	3asdasd21	11/12/2021
3	1asdsaa22	2/12/2021

我试过使用pandas合并

activity_log.merge(client_table[['client_id','user_created']], how='inner', on='client_id')

这个问题是我最终得到的 table 比原来的 activity_log 大，因为 client_id 在 [=34= 中出现了多次] 并在 user_table.

中多次

我想查找 user_table 中的 client_id，获取最早的 user_created 值并将其放入 [=34= 中的 created_date 列].

关于实现此目标我还需要做什么的任何想法？

Answer 1

听起来您想从 usr_df 中获取最早的事件，您可以在按日期排序后使用 groupby 和 first 来做到这一点：

df1 = usr_df.sort_values('user_created', ascending = True).groupby('client_id').first()

df1 看起来像这样：


     user_id    user_created
client_id       
1   asasdsa2    2021-10-12
2   3asdasd21   2021-11-12
3   1asdsaa22   2021-02-12

现在您可以将 act_df 与这个合并：

act_df.merge(df1, on = 'client_id')

输出：

      client_id  activity_date    user_id    user_created
--  -----------  ---------------  ---------  -------------------
 0            1  12/12/2022       asasdsa2   2021-10-12 00:00:00
 1            1  11/12/2022       asasdsa2   2021-10-12 00:00:00
 2            1  9/12/2022        asasdsa2   2021-10-12 00:00:00
 3            1  8/12/2022        asasdsa2   2021-10-12 00:00:00
 4            2  12/12/2022       3asdasd21  2021-11-12 00:00:00
 5            2  11/12/2022       3asdasd21  2021-11-12 00:00:00
 6            3  10/12/2022       1asdsaa22  2021-02-12 00:00:00
 7            3  9/12/2022        1asdsaa22  2021-02-12 00:00:00

Pandas 没有合并的索引匹配

Pandas index match without merge

python

merge

pandas