需要扩展库存日志(日志)pandas 数据框以包含每个产品 ID 的所有日期
Need to expand an inventory journal (log) pandas dataframe to include all dates per product id
我有一个库存日志,其中包含产品及其相对库存数量 (resulting_qty) 以及每次添加或减去库存时 loss/gain (delta_qty)。
问题是库存记录不会每天更新,而是仅在库存发生变化时才会更新。出于这个原因,很难提取给定日期所有项目的总库存数量,因为有些项目没有在特定日期记录,尽管他们确实有可用库存,因为他们最后一次输入 resulting_qty 是大于 0。从逻辑上讲,这意味着一件商品在一定天数内没有数量变化,天数等于最大日期和最后记录日期之间的天数。
我的数据看起来像这样,但实际上有成千上万的产品 ID
| date | timestamp | pid | delta_qty | resulting_qty |
|------------|---------------------|-----|-----------|---------------|
| 2017-03-06 | 2017-03-06 12:24:22 | A | 0 | 0.0 |
| 2017-03-31 | 2017-03-31 02:43:11 | A | 3 | 3.0 |
| 2017-04-08 | 2017-04-08 22:04:35 | A | -1 | 2.0 |
| 2017-04-12 | 2017-04-12 18:26:39 | A | -1 | 1.0 |
| 2017-04-19 | 2017-04-19 09:15:38 | A | -1 | 0.0 |
| 2019-01-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-05 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-22 | 2019-04-22 11:06:33 | B | -1 | 1.0 |
| 2019-04-23 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-09 | 2019-05-09 16:25:41 | C | 2 | 2.0 |
本质上,我需要让数据看起来更像这样,这样我就可以简单地提取一个日期,并在按日期分组时获得给定日期的总库存总和(例如 df.groupby(date ).resulting_qty.sum()):
注意 由于字符限制,我删除了 PID= A 行,但我希望你能理解:
| date | timestamp | pid | delta_qty | resulting_qty |
|------------|---------------------|-----|-----------|---------------|
| 2019-01-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-17 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-18 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-19 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-20 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-21 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-22 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-23 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-24 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-25 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-26 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-27 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-28 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-29 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-30 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-31 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-01 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-02 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-03 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-04 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-05 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-06 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-07 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-08 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-09 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-10 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-11 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-12 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-13 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-14 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-15 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-17 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-18 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-19 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-20 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-21 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-22 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-23 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-24 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-25 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-26 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-27 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-28 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-01 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-02 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-03 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-04 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-05 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-06 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-07 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-08 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-09 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-10 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-11 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-12 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-13 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-14 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-15 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-17 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-18 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-19 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-20 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-21 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-22 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-23 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-24 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-25 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-26 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-27 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-28 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-29 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-30 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-31 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-01 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-02 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-03 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-04 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-05 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-06 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-07 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-08 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-09 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-10 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-11 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-12 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-13 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-14 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-15 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-16 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-17 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-18 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-19 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-20 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-21 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-22 | 2019-04-22 11:06:33 | B | -1 | 1.0 |
| 2019-04-23 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-24 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-25 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-26 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-27 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-28 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-29 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-30 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-01 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-02 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-03 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-04 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-05 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-06 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-07 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-08 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-09 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-10 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-01-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-20 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-21 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-22 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-23 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-24 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-25 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-26 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-27 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-28 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-29 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-30 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-31 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-01 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-02 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-03 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-04 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-05 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-06 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-07 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-08 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-09 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-10 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-11 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-12 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-13 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-14 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-15 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-16 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-17 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-18 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-20 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-21 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-22 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-23 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-24 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-25 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-26 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-27 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-28 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-01 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-02 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-03 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-04 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-05 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-06 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-07 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-08 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-09 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-10 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-11 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-12 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-13 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-14 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-15 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-16 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-17 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-18 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-20 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-21 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-22 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-23 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-24 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-25 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-26 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-27 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-28 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-29 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-30 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-31 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-01 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-02 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-03 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-04 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-05 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-06 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-07 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-08 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-09 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-10 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-11 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-12 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-13 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-14 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-15 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-16 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-17 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
到目前为止,我所做的是创建一系列循环,这些循环生成一个介于产品生命周期的最短日期和所有产品的最大日期之间的日期范围。然后,如果没有关于新日期的信息,我将最后记录的行值附加为具有新日期的新行。我将这些附加到列表中,然后使用更新后的列表生成一个新的数据框。该代码非常慢,需要 2 个多小时才能完成整个数据集:
date_list = []
pid_list= []
time_stamp_list = []
delta_qty_list = []
resulting_qty_list = []
timer = len(test.product_id.unique().tolist())
counter = 0
for product in test.product_id.unique().tolist():
counter+=1
print((counter/timer)*100)
temp_df = test.query(f'product_id=={product}', engine='python')
for idx,date in enumerate(pd.date_range(temp_df.index.min(),test.index.max()).tolist()):
min_date= temp_df.index.min()
if date.date() == min_date:
date2=min_date
pid = temp_df.loc[date2]['product_id']
timestamp = temp_df.loc[date2]['timestamp']
delta_qty = temp_df.loc[date2]['delta_qty']
resulting_qty = temp_df.loc[date2]['resulting_qty']
date_list.append(date2)
pid_list.append(pid)
delta_qty_list.append(delta_qty)
time_stamp_list.append(timestamp)
resulting_qty_list.append(resulting_qty)
else:
if date.date() in temp_df.index:
date2= date.date()
pid = temp_df.loc[date2]['product_id']
timestamp = temp_df.loc[date2]['timestamp']
delta_qty = temp_df.loc[date2]['delta_qty']
resulting_qty = temp_df.loc[date2]['resulting_qty']
date_list.append(date2)
pid_list.append(pid)
delta_qty_list.append(delta_qty)
time_stamp_list.append(timestamp)
resulting_qty_list.append(resulting_qty)
elif date.date() > date2:
date_list.append(date.date())
pid_list.append(pid)
time_stamp_list.append(timestamp)
delta_qty_list.append(delta_qty)
resulting_qty_list.append(resulting_qty)
else:
pass
有人可以帮助我理解什么是正确的处理方法吗,因为我 100% 确定这不是最好的方法。
谢谢
这里的想法是重新索引 DataFrame
以填补您的空白。
设置使用您的示例生成的 DataFrame
:
from io import StringIO
buffer = StringIO()
buffer.write('''\
date|timestamp|pid|delta_qty|resulting_qty
2017-03-06|2017-03-06 12:24:22|A|0|0.0
2017-03-31|2017-03-31 02:43:11|A|3|3.0
2017-04-08|2017-04-08 22:04:35|A|-1|2.0
2017-04-12|2017-04-12 18:26:39|A|-1|1.0
2017-04-19|2017-04-19 09:15:38|A|-1|0.0
2019-01-16|2019-01-16 23:37:17|B|0|0.0
2019-01-19|2019-01-19 09:40:38|C|0|0.0
2019-04-05|2019-04-05 16:40:32|B|2|2.0
2019-04-22|2019-04-22 11:06:33|B|-1|1.0
2019-04-23|2019-04-23 13:23:17|B|-1|0.0
2019-05-09|2019-05-09 16:25:41|C|2|2.0
''')
buffer.seek(0)
df = pd.read_csv(buffer, sep='|', parse_dates=['date', 'timestamp'])
首先,我们在每个产品的最小日期和最大日期之间生成一个新的 gap-less 索引。根据您的示例,这具有在上次现有更新之后没有产品行的效果。但是,此步骤很容易定制以满足您的具体要求。例如,如果您希望日期从第一次输入产品到今天,您可以手动设置 start
和 end
。
from itertools import chain, cycle
date_ranges = df.groupby('pid').agg({'date': ['min', 'max']})
pairs = (zip(cycle([pid]), pd.date_range(start, end))
for pid, (start, end) in date_ranges.iterrows())
new_index = pd.Index(chain.from_iterable(pairs), name=['pid', 'date'])
然后我们应用新索引。这里我们有两个选择:
- 根据您的示例,我们将完全按照上次更新继续填充
- 用
0
填充 delta_qty
和最后更新的剩余列(这与您的要求有偏差,但看起来合乎逻辑,只是一个小改动)
无论哪种情况,两个基本概念是.reindex
方法和.fillna
方法。我们可以使用 reindex
来扩展密集的 DataFrame
以包含所有日期但数据稀疏。然后,我们用正确的数据填充 nan
s。由于我们是上次更新的 forward-padding,我们希望根据 docs
指定 method='ffill'
方法一:
# this fills the rows per last update
results = df.set_index(['pid', 'date'])\
.reindex(new_index).reset_index()
results.fillna(method='ffill', inplace=True)
这个returns
pid date timestamp delta_qty resulting_qty
0 A 2017-03-06 2017-03-06 12:24:22 0.0 0.0
1 A 2017-03-07 2017-03-06 12:24:22 0.0 0.0
2 A 2017-03-08 2017-03-06 12:24:22 0.0 0.0
3 A 2017-03-09 2017-03-06 12:24:22 0.0 0.0
.. .. ... ... ... ...
24 A 2017-03-30 2017-03-06 12:24:22 0.0 0.0
25 A 2017-03-31 2017-03-31 02:43:11 3.0 3.0
.. .. ... ... ... ...
29 A 2017-04-04 2017-03-31 02:43:11 3.0 3.0
对于pid == 'A'
方法二:
results = df.set_index(['pid', 'date'])\
.reindex(new_index).reset_index()
results['delta_qty'].fillna(0, inplace=True)
results.fillna(method='ffill', inplace=True)
这个returns:
pid date timestamp delta_qty resulting_qty
0 A 2017-03-06 2017-03-06 12:24:22 0.0 0.0
1 A 2017-03-07 2017-03-06 12:24:22 0.0 0.0
2 A 2017-03-08 2017-03-06 12:24:22 0.0 0.0
3 A 2017-03-09 2017-03-06 12:24:22 0.0 0.0
.. .. ... ... ... ...
24 A 2017-03-30 2017-03-06 12:24:22 0.0 0.0
25 A 2017-03-31 2017-03-31 02:43:11 3.0 3.0
.. .. ... ... ... ...
29 A 2017-04-04 2017-03-31 02:43:11 0.0 3.0
我有一个库存日志,其中包含产品及其相对库存数量 (resulting_qty) 以及每次添加或减去库存时 loss/gain (delta_qty)。
问题是库存记录不会每天更新,而是仅在库存发生变化时才会更新。出于这个原因,很难提取给定日期所有项目的总库存数量,因为有些项目没有在特定日期记录,尽管他们确实有可用库存,因为他们最后一次输入 resulting_qty 是大于 0。从逻辑上讲,这意味着一件商品在一定天数内没有数量变化,天数等于最大日期和最后记录日期之间的天数。
我的数据看起来像这样,但实际上有成千上万的产品 ID
| date | timestamp | pid | delta_qty | resulting_qty |
|------------|---------------------|-----|-----------|---------------|
| 2017-03-06 | 2017-03-06 12:24:22 | A | 0 | 0.0 |
| 2017-03-31 | 2017-03-31 02:43:11 | A | 3 | 3.0 |
| 2017-04-08 | 2017-04-08 22:04:35 | A | -1 | 2.0 |
| 2017-04-12 | 2017-04-12 18:26:39 | A | -1 | 1.0 |
| 2017-04-19 | 2017-04-19 09:15:38 | A | -1 | 0.0 |
| 2019-01-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-05 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-22 | 2019-04-22 11:06:33 | B | -1 | 1.0 |
| 2019-04-23 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-09 | 2019-05-09 16:25:41 | C | 2 | 2.0 |
本质上,我需要让数据看起来更像这样,这样我就可以简单地提取一个日期,并在按日期分组时获得给定日期的总库存总和(例如 df.groupby(date ).resulting_qty.sum()):
注意 由于字符限制,我删除了 PID= A 行,但我希望你能理解:
| date | timestamp | pid | delta_qty | resulting_qty |
|------------|---------------------|-----|-----------|---------------|
| 2019-01-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-17 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-18 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-19 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-20 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-21 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-22 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-23 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-24 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-25 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-26 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-27 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-28 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-29 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-30 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-01-31 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-01 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-02 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-03 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-04 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-05 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-06 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-07 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-08 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-09 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-10 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-11 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-12 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-13 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-14 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-15 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-17 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-18 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-19 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-20 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-21 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-22 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-23 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-24 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-25 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-26 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-27 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-02-28 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-01 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-02 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-03 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-04 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-05 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-06 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-07 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-08 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-09 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-10 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-11 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-12 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-13 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-14 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-15 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-16 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-17 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-18 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-19 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-20 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-21 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-22 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-23 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-24 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-25 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-26 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-27 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-28 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-29 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-30 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-03-31 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-01 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-02 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-03 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-04 | 2019-01-16 23:37:17 | B | 0 | 0.0 |
| 2019-04-05 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-06 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-07 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-08 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-09 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-10 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-11 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-12 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-13 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-14 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-15 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-16 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-17 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-18 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-19 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-20 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-21 | 2019-04-05 16:40:32 | B | 2 | 2.0 |
| 2019-04-22 | 2019-04-22 11:06:33 | B | -1 | 1.0 |
| 2019-04-23 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-24 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-25 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-26 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-27 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-28 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-29 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-04-30 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-01 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-02 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-03 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-04 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-05 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-06 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-07 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-08 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-09 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-05-10 | 2019-04-23 13:23:17 | B | -1 | 0.0 |
| 2019-01-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-20 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-21 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-22 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-23 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-24 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-25 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-26 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-27 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-28 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-29 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-30 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-01-31 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-01 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-02 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-03 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-04 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-05 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-06 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-07 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-08 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-09 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-10 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-11 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-12 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-13 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-14 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-15 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-16 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-17 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-18 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-20 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-21 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-22 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-23 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-24 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-25 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-26 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-27 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-02-28 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-01 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-02 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-03 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-04 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-05 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-06 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-07 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-08 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-09 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-10 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-11 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-12 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-13 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-14 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-15 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-16 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-17 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-18 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-19 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-20 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-21 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-22 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-23 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-24 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-25 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-26 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-27 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-28 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-29 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-30 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-03-31 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-01 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-02 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-03 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-04 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-05 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-06 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-07 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-08 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-09 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-10 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-11 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-12 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-13 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-14 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-15 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-16 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
| 2019-04-17 | 2019-01-19 09:40:38 | C | 0 | 0.0 |
到目前为止,我所做的是创建一系列循环,这些循环生成一个介于产品生命周期的最短日期和所有产品的最大日期之间的日期范围。然后,如果没有关于新日期的信息,我将最后记录的行值附加为具有新日期的新行。我将这些附加到列表中,然后使用更新后的列表生成一个新的数据框。该代码非常慢,需要 2 个多小时才能完成整个数据集:
date_list = []
pid_list= []
time_stamp_list = []
delta_qty_list = []
resulting_qty_list = []
timer = len(test.product_id.unique().tolist())
counter = 0
for product in test.product_id.unique().tolist():
counter+=1
print((counter/timer)*100)
temp_df = test.query(f'product_id=={product}', engine='python')
for idx,date in enumerate(pd.date_range(temp_df.index.min(),test.index.max()).tolist()):
min_date= temp_df.index.min()
if date.date() == min_date:
date2=min_date
pid = temp_df.loc[date2]['product_id']
timestamp = temp_df.loc[date2]['timestamp']
delta_qty = temp_df.loc[date2]['delta_qty']
resulting_qty = temp_df.loc[date2]['resulting_qty']
date_list.append(date2)
pid_list.append(pid)
delta_qty_list.append(delta_qty)
time_stamp_list.append(timestamp)
resulting_qty_list.append(resulting_qty)
else:
if date.date() in temp_df.index:
date2= date.date()
pid = temp_df.loc[date2]['product_id']
timestamp = temp_df.loc[date2]['timestamp']
delta_qty = temp_df.loc[date2]['delta_qty']
resulting_qty = temp_df.loc[date2]['resulting_qty']
date_list.append(date2)
pid_list.append(pid)
delta_qty_list.append(delta_qty)
time_stamp_list.append(timestamp)
resulting_qty_list.append(resulting_qty)
elif date.date() > date2:
date_list.append(date.date())
pid_list.append(pid)
time_stamp_list.append(timestamp)
delta_qty_list.append(delta_qty)
resulting_qty_list.append(resulting_qty)
else:
pass
有人可以帮助我理解什么是正确的处理方法吗,因为我 100% 确定这不是最好的方法。
谢谢
这里的想法是重新索引 DataFrame
以填补您的空白。
设置使用您的示例生成的 DataFrame
:
from io import StringIO
buffer = StringIO()
buffer.write('''\
date|timestamp|pid|delta_qty|resulting_qty
2017-03-06|2017-03-06 12:24:22|A|0|0.0
2017-03-31|2017-03-31 02:43:11|A|3|3.0
2017-04-08|2017-04-08 22:04:35|A|-1|2.0
2017-04-12|2017-04-12 18:26:39|A|-1|1.0
2017-04-19|2017-04-19 09:15:38|A|-1|0.0
2019-01-16|2019-01-16 23:37:17|B|0|0.0
2019-01-19|2019-01-19 09:40:38|C|0|0.0
2019-04-05|2019-04-05 16:40:32|B|2|2.0
2019-04-22|2019-04-22 11:06:33|B|-1|1.0
2019-04-23|2019-04-23 13:23:17|B|-1|0.0
2019-05-09|2019-05-09 16:25:41|C|2|2.0
''')
buffer.seek(0)
df = pd.read_csv(buffer, sep='|', parse_dates=['date', 'timestamp'])
首先,我们在每个产品的最小日期和最大日期之间生成一个新的 gap-less 索引。根据您的示例,这具有在上次现有更新之后没有产品行的效果。但是,此步骤很容易定制以满足您的具体要求。例如,如果您希望日期从第一次输入产品到今天,您可以手动设置 start
和 end
。
from itertools import chain, cycle
date_ranges = df.groupby('pid').agg({'date': ['min', 'max']})
pairs = (zip(cycle([pid]), pd.date_range(start, end))
for pid, (start, end) in date_ranges.iterrows())
new_index = pd.Index(chain.from_iterable(pairs), name=['pid', 'date'])
然后我们应用新索引。这里我们有两个选择:
- 根据您的示例,我们将完全按照上次更新继续填充
- 用
0
填充delta_qty
和最后更新的剩余列(这与您的要求有偏差,但看起来合乎逻辑,只是一个小改动)
无论哪种情况,两个基本概念是.reindex
方法和.fillna
方法。我们可以使用 reindex
来扩展密集的 DataFrame
以包含所有日期但数据稀疏。然后,我们用正确的数据填充 nan
s。由于我们是上次更新的 forward-padding,我们希望根据 docs
method='ffill'
方法一:
# this fills the rows per last update
results = df.set_index(['pid', 'date'])\
.reindex(new_index).reset_index()
results.fillna(method='ffill', inplace=True)
这个returns
pid date timestamp delta_qty resulting_qty
0 A 2017-03-06 2017-03-06 12:24:22 0.0 0.0
1 A 2017-03-07 2017-03-06 12:24:22 0.0 0.0
2 A 2017-03-08 2017-03-06 12:24:22 0.0 0.0
3 A 2017-03-09 2017-03-06 12:24:22 0.0 0.0
.. .. ... ... ... ...
24 A 2017-03-30 2017-03-06 12:24:22 0.0 0.0
25 A 2017-03-31 2017-03-31 02:43:11 3.0 3.0
.. .. ... ... ... ...
29 A 2017-04-04 2017-03-31 02:43:11 3.0 3.0
对于pid == 'A'
方法二:
results = df.set_index(['pid', 'date'])\
.reindex(new_index).reset_index()
results['delta_qty'].fillna(0, inplace=True)
results.fillna(method='ffill', inplace=True)
这个returns:
pid date timestamp delta_qty resulting_qty
0 A 2017-03-06 2017-03-06 12:24:22 0.0 0.0
1 A 2017-03-07 2017-03-06 12:24:22 0.0 0.0
2 A 2017-03-08 2017-03-06 12:24:22 0.0 0.0
3 A 2017-03-09 2017-03-06 12:24:22 0.0 0.0
.. .. ... ... ... ...
24 A 2017-03-30 2017-03-06 12:24:22 0.0 0.0
25 A 2017-03-31 2017-03-31 02:43:11 3.0 3.0
.. .. ... ... ... ...
29 A 2017-04-04 2017-03-31 02:43:11 0.0 3.0