应用 dask groupby 中的元数据顺序

order of metadata in dask groupby apply

突然出现错误:“ValueError:计算数据中的列与提供的元数据中的列不匹配 列的顺序不匹配

这对我来说没有意义,因为我确实提供了正确的元数据。它不是按字典提供的那样排序的。

下面是一个最小的工作示例:

from datetime import date
import pandas as pd
import numpy as np
from dask import delayed
import dask.dataframe as dsk

# Making example data
values = pd.DataFrame({'date' : [date(2020,1,1), date(2020,1,1), date(2020,1,2), date(2020,1,2)], 'id' : [1,2,1,2], 'A': [4,5,2,2], 'B':[7,3,6,1]})
def get_dates():
    return pd.DataFrame({'date' : [date(2020,1,1), date(2020,1,1), date(2020,1,2), date(2020,1,2)]})
def append_values(df):
    df2 = pd.merge(df, values, on = 'date', how = 'left')
    return df2
t0 = pd.DataFrame({'date' : [date(2020,1,1), date(2020,1,1), date(2020,1,2), date(2020,1,2)]})
t1 = delayed(t0)
t2 = dsk.from_delayed(t1)
t = t2.map_partitions(append_values, meta = {'A' : 'f8', 'B': 'i8', 'id' : 'i8', 'date' : 'object'}, enforce_metadata = False)

# Applying a grouped function.
def func(x,y):
    return pd.DataFrame({'summ' : [np.mean(x) + np.mean(y)], 'difference' : [int(np.floor(np.mean(x) - np.mean(y)))]})

# Everything works when I compute the dataframe before doing the apply. But I want to distribute the apply so I dont like this option.
res = t.compute().groupby(['date']).apply(lambda df: func(df['A'], df['B']))
# This fails as the meta is out of order. But the meta is in a dict and is hence not supposted to be ordered anyway!
res = t.groupby(['date']).apply(lambda df: func(df['A'], df['B'])).compute()

我在这里做错了什么,我该如何解决?虽然一种解决方法是在执行分组操作之前进行计算,但这对于我的实际情况(有太多数据无法保存在 RAM 中)是不可行的。

另一个可能相关但我认为不是的问题:ValueError: The columns in the computed data do not match the columns in the provided metadata。这似乎与使用 dask

解析 csv 有关

提供给 metadict 中的键顺序似乎很重要。如下更改顺序,只会产生警告:

    # changing the order of keys in this dict
    # meta={"date": "object", "id": "i8", "B": "i8", "A": "f8", },
    meta={"date": "object", "id": "i8", "A": "f8", "B": "i8"},

我的猜测是 Dask 在内部使用键的顺序来构造元数据框,但不太确定。问题是在 t.compute() 之后 df 是 pandas 数据帧,所以后续的 groupby 知道要选择哪些列(不依赖于顺序),而在 .compute 之前,数据帧仍然是一个 dask 数据帧(懒惰)和 dask 正在尝试查找具有元中给定顺序的列(然后发现不匹配)...