从 pandas groupby 获取数据帧以写入 parquet

Question

我有一些包含以下列的 csv 数据：

country, region, year, month, price, volume

我需要将其转换为如下内容：

country, region, datapoints

其中数据点包括： (year, month, price, volume) 元组数组，或（更好）

{ (year, month) : {price, volume} }

实际上，我正在尝试将数据重塑为一个时间序列，然后可以将其存储为镶木地板。对于它的价值，我正在使用 fastparquet 将数据帧写入镶木地板文件。

这可能吗？

Answer 1

您可以使用 apply 创建列 'datapoint':

df['datapoint'] = df.apply(lambda row: (row['year'],row['month'],
                                         row['price'],row['volume']),1)

或

df['datapoint_better'] = df.apply(lambda row: {(row['year'],row['month']):
                                                 {row['price'],row['volume']}},1)

正如我所说，您不能将 {row['year'],row['month']} 作为字典中的关键字

然后，如果您想乘坐专栏：

df = df.drop(['year','month','price','volume'],1)

编辑：好的，我错过了 groupby，无论如何，您可以先使用键和项目创建两列：

df['key'] = df.apply(lambda row: ( row['year'], row['month']),1)
df['item'] = df.apply(lambda row: { row['price'], row['volume']},1)

然后你用 apply 做 groupby 并用这两列做 pd.Series.to_dict 例如：

df_output = (df.groupby(['country','region'])
               .apply(lambda df_grouped: pd.Series(df_grouped.item.values,
                                                   index=df_grouped.key).to_dict())
               .reset_index().rename(columns={0:'datapoints'}))

reset_index和rename是为了得到预期的输出

注意：我建议对商品也使用 tuple 而不是 set 以防止任何订单问题，因为 set 未订购。

从 pandas groupby 获取数据帧以写入 parquet

Getting a dataframe from a pandas groupby to write to parquet

python

pandas

parquet

pandas-groupby

fastparquet