如何使用 Pandas 插入包含购买顺序的列?
How to insert a column with the order in which the purchases were made using Pandas?
我有以下购买数据框:
[in]:
data = [[1, '01/01/2020', 'product_A'],
[1, '01/02/2020', 'product_B'],
[1, '01/01/2020', 'product_C'],
[2, '01/01/2020', 'product_A'],
[2, '01/09/2020', 'product_B'],
[3, '01/01/2020', 'product_C'],
[4, '01/01/2020', 'product_A'],
[5, '01/09/2020', 'product_B'],
[5, '01/01/2020', 'product_C'],
[5, '01/14/2020', 'product_A'],
[2, '01/09/2020', 'product_B'],
[1, '01/01/2020', 'product_C']]
df = pd.DataFrame(data, columns = ['client_id', 'purchase_date','product_name'])
df
[out]:
client_id purchase_date product_name
0 1 01/01/2020 product_A
1 1 01/02/2020 product_B
2 1 01/01/2020 product_C
3 2 01/01/2020 product_A
4 2 01/09/2020 product_B
5 3 01/01/2020 product_C
6 4 01/01/2020 product_A
7 5 01/09/2020 product_B
8 5 01/01/2020 product_C
9 5 01/14/2020 product_A
10 2 01/09/2020 product_B
11 1 01/01/2020 product_C
我需要添加一列,其中包含每次购买的顺序。
我已经使用 for 循环以我的方式管理蛮力来做到这一点:
[in]:
df = df.sort_values(["client_id", "purchase_date"], ascending = (True, True))
first_client = df['client_id'].iloc[0]
first_date = df['purchase_date'].iloc[0]
purchase_order = []
purchase_order_index = 1
for index, row in df .iterrows():
client = row['client_id']
date = row['purchase_date']
if client != first_client:
first_client = client
first_date = date
purchase_order_index = 1
elif first_date != date:
first_date = date
purchase_order_index += 1
purchase_order.append(purchase_order_index)
df['purchase_order_index'] = purchase_order
[out]:
client_id purchase_date product_name purchase_order_index
0 1 01/01/2020 product_A 1
2 1 01/01/2020 product_C 1
11 1 01/01/2020 product_C 1
1 1 01/02/2020 product_B 2
3 2 01/01/2020 product_A 1
4 2 01/09/2020 product_B 2
10 2 01/09/2020 product_B 2
5 3 01/01/2020 product_C 1
6 4 01/01/2020 product_A 1
8 5 01/01/2020 product_C 1
7 5 01/09/2020 product_B 2
9 5 01/14/2020 product_A 3
我已经达到了预期的结果,但是,我知道使用 .iterrows()
并不是最好的解决方案。我相信有比这更有效的解决方案。我正在尝试学习 Pandas 的最佳实践。谁能告诉我如何正确执行此操作?
说明:'purchase_order_index' 列显示了每个客户的购买顺序。为确定此顺序,我使用 'purchase_date' 列。当天购买的商品算作一次购买。
使用 rank,与 method='dense'
:
temp = df.sort_values(['client_id', 'purchase_date'])
(temp.assign(purchase_order_index = temp.groupby('client_id', sort = False)
.purchase_date
.rank(method='dense')
.astype(int))
)
client_id purchase_date product_name purchase_order_index
0 1 2020-01-01 product_A 1
2 1 2020-01-01 product_C 1
11 1 2020-01-01 product_C 1
1 1 2020-01-02 product_B 2
3 2 2020-01-01 product_A 1
4 2 2020-01-09 product_B 2
10 2 2020-01-09 product_B 2
5 3 2020-01-01 product_C 1
6 4 2020-01-01 product_A 1
8 5 2020-01-01 product_C 1
7 5 2020-01-09 product_B 2
9 5 2020-01-14 product_A 3
我有以下购买数据框:
[in]:
data = [[1, '01/01/2020', 'product_A'],
[1, '01/02/2020', 'product_B'],
[1, '01/01/2020', 'product_C'],
[2, '01/01/2020', 'product_A'],
[2, '01/09/2020', 'product_B'],
[3, '01/01/2020', 'product_C'],
[4, '01/01/2020', 'product_A'],
[5, '01/09/2020', 'product_B'],
[5, '01/01/2020', 'product_C'],
[5, '01/14/2020', 'product_A'],
[2, '01/09/2020', 'product_B'],
[1, '01/01/2020', 'product_C']]
df = pd.DataFrame(data, columns = ['client_id', 'purchase_date','product_name'])
df
[out]:
client_id purchase_date product_name
0 1 01/01/2020 product_A
1 1 01/02/2020 product_B
2 1 01/01/2020 product_C
3 2 01/01/2020 product_A
4 2 01/09/2020 product_B
5 3 01/01/2020 product_C
6 4 01/01/2020 product_A
7 5 01/09/2020 product_B
8 5 01/01/2020 product_C
9 5 01/14/2020 product_A
10 2 01/09/2020 product_B
11 1 01/01/2020 product_C
我需要添加一列,其中包含每次购买的顺序。
我已经使用 for 循环以我的方式管理蛮力来做到这一点:
[in]:
df = df.sort_values(["client_id", "purchase_date"], ascending = (True, True))
first_client = df['client_id'].iloc[0]
first_date = df['purchase_date'].iloc[0]
purchase_order = []
purchase_order_index = 1
for index, row in df .iterrows():
client = row['client_id']
date = row['purchase_date']
if client != first_client:
first_client = client
first_date = date
purchase_order_index = 1
elif first_date != date:
first_date = date
purchase_order_index += 1
purchase_order.append(purchase_order_index)
df['purchase_order_index'] = purchase_order
[out]:
client_id purchase_date product_name purchase_order_index
0 1 01/01/2020 product_A 1
2 1 01/01/2020 product_C 1
11 1 01/01/2020 product_C 1
1 1 01/02/2020 product_B 2
3 2 01/01/2020 product_A 1
4 2 01/09/2020 product_B 2
10 2 01/09/2020 product_B 2
5 3 01/01/2020 product_C 1
6 4 01/01/2020 product_A 1
8 5 01/01/2020 product_C 1
7 5 01/09/2020 product_B 2
9 5 01/14/2020 product_A 3
我已经达到了预期的结果,但是,我知道使用 .iterrows()
并不是最好的解决方案。我相信有比这更有效的解决方案。我正在尝试学习 Pandas 的最佳实践。谁能告诉我如何正确执行此操作?
说明:'purchase_order_index' 列显示了每个客户的购买顺序。为确定此顺序,我使用 'purchase_date' 列。当天购买的商品算作一次购买。
使用 rank,与 method='dense'
:
temp = df.sort_values(['client_id', 'purchase_date'])
(temp.assign(purchase_order_index = temp.groupby('client_id', sort = False)
.purchase_date
.rank(method='dense')
.astype(int))
)
client_id purchase_date product_name purchase_order_index
0 1 2020-01-01 product_A 1
2 1 2020-01-01 product_C 1
11 1 2020-01-01 product_C 1
1 1 2020-01-02 product_B 2
3 2 2020-01-01 product_A 1
4 2 2020-01-09 product_B 2
10 2 2020-01-09 product_B 2
5 3 2020-01-01 product_C 1
6 4 2020-01-01 product_A 1
8 5 2020-01-01 product_C 1
7 5 2020-01-09 product_B 2
9 5 2020-01-14 product_A 3