无法将 pandas 数据帧保存到镶木地板,并将浮点列表作为单元格值

Can not save pandas dataframe to parquet with lists of floats as cell value

我有一个结构如下的数据框:

                                                Coumn1                                             Coumn2
0    (0.00030271668219938874, 0.0002655923890415579...  (0.0016430083196610212, 0.0014970217598602176,...
1    (0.00015607803652528673, 0.0001314736582571640...  (0.0022136708721518517, 0.0014974646037444472,...
2    (0.011317798867821693, 0.011339936405420303, 0...  (0.004868391435593367, 0.004406007472425699, 0...
3    (3.94578673876822e-05, 3.075833956245333e-05, ...  (0.0075020878575742245, 0.0096737677231431, 0....
4    (0.0004926157998852432, 0.0003811710048466921,...  (0.010351942852139473, 0.008231297135353088, 0...
..                                                 ...                                                ...
130  (0.011190211400389671, 0.011337820440530777, 0...  (0.010182800702750683, 0.011351295746862888, 0...
131  (0.006286659277975559, 0.007315031252801418, 0...  (0.02104150503873825, 0.02531484328210354, 0.0...
132  (0.0022791570518165827, 0.0025983047671616077,...  (0.008847278542816639, 0.009222050197422504, 0...
133  (0.0007059817435219884, 0.0009831463685259223,...  (0.0028264704160392284, 0.0029402063228189945,...
134  (0.0018992726691067219, 0.002058899961411953, ...  (0.0019639385864138603, 0.002009353833273053, ...

[135 rows x 2 columns]

其中每个单元格包含一些浮点值的 list/tuple:

type(psd_res.data_frame['Column1'][0])
<class 'tuple'>
type(psd_res.data_frame['Column1'][0][0])
<class 'numpy.float64'>

(每个单元格条目在元组中包含相同数量的条目)

当我现在尝试将数据帧保存为 parquet 时出现错误 (fastparquet):

Can't infer object conversion type: 0    (0.00030271668219938874, 0.0002655923890415579...
1    (0.00015607803652528673, 0.0001314736582571640...
...

Name: Column1, dtype: object

完整堆栈跟踪:https://pastebin.com/8Myu8hNV

我也用其他引擎 pyarrow 试过了:

pyarrow.lib.ArrowInvalid: ('Could not convert (0.00030271668219938874, ..., 0.0002464042045176029)
  with type tuple: did not recognize Python value type when inferring an Arrow data type', 
  'Conversion failed for column UO-Pumpe with type object')

所以我找到了这个帖子 https://github.com/dask/fastparquet/issues/458。这似乎是 fastparquet 中的一个错误 - 但它应该在 pyarrow 中工作,但对我来说失败了。

然后我尝试了一些我发现的东西,比如 infer_objects()astype(float) ...到目前为止没有任何效果。

有没有人知道如何将我的数据框保存到 parquet?

数据框的单元格包含浮点元组。这是一个不寻常的数据类型。

所以你需要给箭头一点帮助来弄清楚你的数据类型。为此,您需要明确提供 table 的架构。

df = pd.DataFrame(
    {
        "column1": [(1.0, 2.0), (3.0, 4.0, 5.0)]
    }
)
schema = pa.schema([pa.field('column1', pa.list_(pa.float64()))])
df.to_parquet('/tmp/hello.pq', schema=schema)

请注意,如果您使用浮点数列表(而不是元组),它会起作用:

df = pd.DataFrame(
    {
        "column1": [[1.0, 2.0], [3.0, 4.0, 5.0]]
    }
)
df.to_parquet('/tmp/hello.pq')