Pandas 尝试在对列进行 groupby 后将列中的值转换为新列

Pandas trying to make values within a column into new columns after groupby on column

我的原始数据框如下所示:

    timestamp                     variables     value

1   2017-05-26 19:46:41.289       inf           0.000000
2   2017-05-26 20:40:41.243       tubavg        225.489639
... ... ... ...
899541  2017-05-02 20:54:41.574   caspre        684.486450
899542  2017-04-29 11:17:25.126   tvol          50.895000

现在我想按时间对这个数据集进行分桶,这可以用代码来完成:

df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))

但我还希望所有不同的指标都成为新数据框中的列。例如,原始数据框中的前两行如下所示:

       timestamp                     inf         tubavg         caspre         tvol      ...

1      2017-05-26 19:46:41.289       0.000000    225.489639     xxxxxxx        xxxxx
... ... ... ...
xxxxx  2017-05-02 20:54:41.574       xxxxxx      xxxxxx         684.486450     50.895000

现在可以看出,时间已按 5 分钟间隔分桶,将查看 variables 的所有值并尝试为所有分桶的这些列创建列。存储桶已采用其存储时的第一个值。

为了解决这个问题,我尝试了几个不同的解决方案,但似乎找不到任何没有不断错误的东西。

  1. 尝试使用 .unstack(1)variables 列从行拆分为列。参数是 1,因为我们想要第二个索引列(0 将是第一个)
  2. 然后,降低您刚刚创建的 multi-index 的级别,使其更清晰一些 .droplevel()
  3. 最后,使用pd.Grouper。由于 date/time 在索引上,因此您无需指定键。

df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]: 
variables           timestamp  caspre  inf      tubavg    tvol
0         2017-04-29 11:15:00     NaN  NaN         NaN  50.895
1         2017-04-29 11:20:00     NaN  NaN         NaN     NaN
2         2017-04-29 11:25:00     NaN  NaN         NaN     NaN
3         2017-04-29 11:30:00     NaN  NaN         NaN     NaN
4         2017-04-29 11:35:00     NaN  NaN         NaN     NaN
                      ...     ...  ...         ...     ...
7885      2017-05-26 20:20:00     NaN  NaN         NaN     NaN
7886      2017-05-26 20:25:00     NaN  NaN         NaN     NaN
7887      2017-05-26 20:30:00     NaN  NaN         NaN     NaN
7888      2017-05-26 20:35:00     NaN  NaN         NaN     NaN
7889      2017-05-26 20:40:00     NaN  NaN  225.489639     NaN

另一种方法是 .groupby variables,然后再次 .unstack(1)

df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]: 
variables           timestamp     caspre  inf      tubavg    tvol
0         2017-04-29 11:15:00        NaN  NaN         NaN  50.895
1         2017-05-02 20:50:00  684.48645  NaN         NaN     NaN
2         2017-05-26 19:45:00        NaN  0.0         NaN     NaN
3         2017-05-26 20:40:00        NaN  NaN  225.489639     NaN