Python:Return 每个产品的第一个值基于日期
Python: Return First value based on dates for each product
我正在寻找一种基于日期和产品创建 pandas 数据框子集的迭代方法。我想在 2 周内保留每个产品的第一行 window。
所以对于 df A:
Date,Product,Return
1/1/2020,ABC,0.00993
1/2/2020,ABC,0.04231
1/4/2020,ABC,0.04231
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
6/16/2020,XYZ,0.05284
6/19/2020,XYZ,0.05284
6/25/2020,XYZ,0.01578
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/17/2020,EFG,0.02859
7/21/2020,EFG,0.02084
7/27/2020,EFG,0.05154
7/29/2020,EFG,0.05154
9/8/2020,EFG,0.0616
9/14/2020,EFG,0.04092
9/18/2020,EFG,0.01578
9/22/2020,EFG,0.03248
6/9/2020,ASD,0.03248
我要DF B返回:
Date,Product,Return
1/1/2020,ABC,0.00993
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/27/2020,EFG,0.05154
9/8/2020,EFG,0.0616
6/9/2020,ASD,0.03248
我的总数据框有 10k 个产品,我尝试使用 .loc 创建一个基于 datetime/time 增量的变量,但它可以基于先前产品的日期
您需要一些方法来按周差对它们进行分组。我建议将日期转换为一年中的第几周(52 周格式),按产品分组,并在该产品的每个星期之间获取 diff()
。使用它我们可以计算出哪些差异大于 1,并使用 cumsum()
来递增组,使它们不在一起。最后一列 'c' 是额外的分组列。在 product
和 c
上分组并使用 .head(1)
获取每组的第一个值。
df = pd.DataFrame({'Date': ['1/1/2020','1/2/2020','1/4/2020','1/30/2020',
'2/20/2020','6/15/2020','6/16/2020','6/19/2020','6/25/2020',
'8/25/2020','9/25/2020','10/12/2020','12/2/2020','6/11/2020',
'7/13/2020','7/17/2020','7/21/2020','7/27/2020', '7/29/2020',
'9/8/2020','9/14/2020','9/18/2020','9/22/2020','6/9/2020'],
'Product': ['ABC','ABC','ABC','ABC','ABC','XYZ','XYZ',
'XYZ','XYZ','XYZ','XYZ','XYZ','XYZ','EFG','EFG','EFG',
'EFG','EFG','EFG','EFG','EFG','EFG','EFG','ASD'],
'Return': [0.00993,0.04231,0.04231,0.04231,0.01408,0.04868,
0.05284,0.05284,0.015780000000000002,0.03248,0.03248,
0.0375, 0.025889999999999996,0.025889999999999996,
0.025889999999999996,0.028589999999999997,
0.02084,0.051539999999999996,0.051539999999999996,
0.0616,0.04092,0.015780000000000002,0.03248,0.03248]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by='Date').reset_index(drop=True)
df['week'] = df['Date'].dt.isocalendar().week
df['c'] = df.groupby('Product')['week'].diff().fillna(0).gt(1).astype(int).cumsum()
df = df.groupby(['Product','c']).head(1)
df.drop(columns=['week','c'], inplace=True)
输出
Date Product Return
0 2020-01-01 ABC 0.00993
3 2020-01-30 ABC 0.04231
4 2020-02-20 ABC 0.01408
5 2020-06-09 ASD 0.03248
6 2020-06-11 EFG 0.02589
7 2020-06-15 XYZ 0.04868
11 2020-07-13 EFG 0.02589
16 2020-08-25 XYZ 0.03248
17 2020-09-08 EFG 0.06160
21 2020-09-25 XYZ 0.03248
22 2020-10-12 XYZ 0.03750
23 2020-12-02 XYZ 0.02589
我正在寻找一种基于日期和产品创建 pandas 数据框子集的迭代方法。我想在 2 周内保留每个产品的第一行 window。
所以对于 df A:
Date,Product,Return
1/1/2020,ABC,0.00993
1/2/2020,ABC,0.04231
1/4/2020,ABC,0.04231
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
6/16/2020,XYZ,0.05284
6/19/2020,XYZ,0.05284
6/25/2020,XYZ,0.01578
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/17/2020,EFG,0.02859
7/21/2020,EFG,0.02084
7/27/2020,EFG,0.05154
7/29/2020,EFG,0.05154
9/8/2020,EFG,0.0616
9/14/2020,EFG,0.04092
9/18/2020,EFG,0.01578
9/22/2020,EFG,0.03248
6/9/2020,ASD,0.03248
我要DF B返回:
Date,Product,Return
1/1/2020,ABC,0.00993
1/30/2020,ABC,0.04231
2/20/2020,ABC,0.01408
6/15/2020,XYZ,0.04868
8/25/2020,XYZ,0.03248
9/25/2020,XYZ,0.03248
10/12/2020,XYZ,0.0375
12/2/2020,XYZ,0.02589
6/11/2020,EFG,0.02589
7/13/2020,EFG,0.02589
7/27/2020,EFG,0.05154
9/8/2020,EFG,0.0616
6/9/2020,ASD,0.03248
我的总数据框有 10k 个产品,我尝试使用 .loc 创建一个基于 datetime/time 增量的变量,但它可以基于先前产品的日期
您需要一些方法来按周差对它们进行分组。我建议将日期转换为一年中的第几周(52 周格式),按产品分组,并在该产品的每个星期之间获取 diff()
。使用它我们可以计算出哪些差异大于 1,并使用 cumsum()
来递增组,使它们不在一起。最后一列 'c' 是额外的分组列。在 product
和 c
上分组并使用 .head(1)
获取每组的第一个值。
df = pd.DataFrame({'Date': ['1/1/2020','1/2/2020','1/4/2020','1/30/2020',
'2/20/2020','6/15/2020','6/16/2020','6/19/2020','6/25/2020',
'8/25/2020','9/25/2020','10/12/2020','12/2/2020','6/11/2020',
'7/13/2020','7/17/2020','7/21/2020','7/27/2020', '7/29/2020',
'9/8/2020','9/14/2020','9/18/2020','9/22/2020','6/9/2020'],
'Product': ['ABC','ABC','ABC','ABC','ABC','XYZ','XYZ',
'XYZ','XYZ','XYZ','XYZ','XYZ','XYZ','EFG','EFG','EFG',
'EFG','EFG','EFG','EFG','EFG','EFG','EFG','ASD'],
'Return': [0.00993,0.04231,0.04231,0.04231,0.01408,0.04868,
0.05284,0.05284,0.015780000000000002,0.03248,0.03248,
0.0375, 0.025889999999999996,0.025889999999999996,
0.025889999999999996,0.028589999999999997,
0.02084,0.051539999999999996,0.051539999999999996,
0.0616,0.04092,0.015780000000000002,0.03248,0.03248]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by='Date').reset_index(drop=True)
df['week'] = df['Date'].dt.isocalendar().week
df['c'] = df.groupby('Product')['week'].diff().fillna(0).gt(1).astype(int).cumsum()
df = df.groupby(['Product','c']).head(1)
df.drop(columns=['week','c'], inplace=True)
输出
Date Product Return
0 2020-01-01 ABC 0.00993
3 2020-01-30 ABC 0.04231
4 2020-02-20 ABC 0.01408
5 2020-06-09 ASD 0.03248
6 2020-06-11 EFG 0.02589
7 2020-06-15 XYZ 0.04868
11 2020-07-13 EFG 0.02589
16 2020-08-25 XYZ 0.03248
17 2020-09-08 EFG 0.06160
21 2020-09-25 XYZ 0.03248
22 2020-10-12 XYZ 0.03750
23 2020-12-02 XYZ 0.02589