如何使用 Pandas 插入包含购买顺序的列?

How to insert a column with the order in which the purchases were made using Pandas?

我有以下购买数据框:

[in]:

data = [[1, '01/01/2020',   'product_A'],
[1, '01/02/2020',   'product_B'],
[1, '01/01/2020',   'product_C'],
[2, '01/01/2020',   'product_A'],
[2, '01/09/2020',   'product_B'],
[3, '01/01/2020',   'product_C'],
[4, '01/01/2020',   'product_A'],
[5, '01/09/2020',   'product_B'],
[5, '01/01/2020',   'product_C'],
[5, '01/14/2020',   'product_A'],
[2, '01/09/2020',   'product_B'],
[1, '01/01/2020',   'product_C']]

df = pd.DataFrame(data, columns = ['client_id', 'purchase_date','product_name'])

df


[out]:

client_id   purchase_date   product_name
0   1   01/01/2020  product_A
1   1   01/02/2020  product_B
2   1   01/01/2020  product_C
3   2   01/01/2020  product_A
4   2   01/09/2020  product_B
5   3   01/01/2020  product_C
6   4   01/01/2020  product_A
7   5   01/09/2020  product_B
8   5   01/01/2020  product_C
9   5   01/14/2020  product_A
10  2   01/09/2020  product_B
11  1   01/01/2020  product_C

我需要添加一列,其中包含每次购买的顺序。

我已经使用 for 循环以我的方式管理蛮力来做到这一点:

[in]:

df = df.sort_values(["client_id", "purchase_date"], ascending = (True, True))


first_client = df['client_id'].iloc[0]
first_date = df['purchase_date'].iloc[0]

purchase_order = []
purchase_order_index = 1

for index, row in df .iterrows():

  client = row['client_id']
  date = row['purchase_date']

  if client != first_client:
    first_client = client
    first_date = date
    purchase_order_index = 1

  elif first_date != date:

    first_date = date
    purchase_order_index += 1

  purchase_order.append(purchase_order_index)


df['purchase_order_index'] = purchase_order

[out]:

client_id   purchase_date   product_name    purchase_order_index
0   1   01/01/2020  product_A   1
2   1   01/01/2020  product_C   1
11  1   01/01/2020  product_C   1
1   1   01/02/2020  product_B   2
3   2   01/01/2020  product_A   1
4   2   01/09/2020  product_B   2
10  2   01/09/2020  product_B   2
5   3   01/01/2020  product_C   1
6   4   01/01/2020  product_A   1
8   5   01/01/2020  product_C   1
7   5   01/09/2020  product_B   2
9   5   01/14/2020  product_A   3

我已经达到了预期的结果,但是,我知道使用 .iterrows() 并不是最好的解决方案。我相信有比这更有效的解决方案。我正在尝试学习 Pandas 的最佳实践。谁能告诉我如何正确执行此操作?

说明:'purchase_order_index' 列显示了每个客户的购买顺序。为确定此顺序,我使用 'purchase_date' 列。当天购买的商品算作一次购买。

使用 rank,与 method='dense':

temp = df.sort_values(['client_id', 'purchase_date'])
(temp.assign(purchase_order_index = temp.groupby('client_id', sort = False)
                                        .purchase_date
                                        .rank(method='dense')
                                        .astype(int))
)
    client_id purchase_date product_name  purchase_order_index
0           1    2020-01-01    product_A                     1
2           1    2020-01-01    product_C                     1
11          1    2020-01-01    product_C                     1
1           1    2020-01-02    product_B                     2
3           2    2020-01-01    product_A                     1
4           2    2020-01-09    product_B                     2
10          2    2020-01-09    product_B                     2
5           3    2020-01-01    product_C                     1
6           4    2020-01-01    product_A                     1
8           5    2020-01-01    product_C                     1
7           5    2020-01-09    product_B                     2
9           5    2020-01-14    product_A                     3