Dask - 如何将 Series 连接到 DataFrame 中?
Dask - How to concatenate Series into a DataFrame with apply?
如何 return 将一个函数的多个值应用于 Dask 系列?
我正在尝试从 dask.Series.apply
的每次迭代中 return 一系列,最终结果是 dask.DataFrame
.
下面的代码告诉我元是错误的。然而,all-pandas 版本有效。这里有什么问题?
更新: 我认为我没有正确指定 meta/schema。我该怎么做才正确?
现在,当我删除 meta 参数时,它就可以工作了。但是,它会发出警告。我想使用 dask "correctly".
import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
def transformMyCol(x):
#Minimal Example Function
return(pd.Series(['Tom - ' + str(x),'Deskflip - ' + str(x / 8),'']))
#
## Pandas Version - Works as expected.
#
pandas_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
pandas_df.target.apply(transformMyCol,1)
#
## Dask Version (second attempt) - Raises a warning
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked = df.target.apply(transformMyCol)
unpacked.head()
#
## Dask Version (first attempt) - Raises an exception
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked_dask_schema = {"name" : str, "action" : str, "comments" : str}
unpacked = df.target.apply(transformMyCol, meta=unpacked_dask_schema)
unpacked.head()
这是我得到的错误:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
我也确认了以下内容,但它也不起作用。
meta_df = pd.DataFrame(dtype='str',columns=list(unpacked_dask_schema.keys()))
unpacked = df.FILEDATA.apply(transformMyCol, meta=meta_df)
unpacked.head()
同样的错误:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
你是对的,问题是你没有正确指定元;更具体地说,正如错误消息所说,元数据列 ("name", "action", "comments"
) 与计算数据 (0, 1, 2
) 中的列不匹配。您应该:
- 将元数据列更改为 0、1、2:
unpacked_dask_schema = dict.fromkeys(range(3), str)
df.target.apply(transformMyCol, meta=unpacked_dask_schema)
或
- 更改
transformMyCol
以使用命名列:
def transformMyCol(x):
return pd.Series({
'name': 'Tom - ' + str(x),
'action': 'Deskflip - ' + str(x / 8),
'comments': '',
}))
如何 return 将一个函数的多个值应用于 Dask 系列?
我正在尝试从 dask.Series.apply
的每次迭代中 return 一系列,最终结果是 dask.DataFrame
.
下面的代码告诉我元是错误的。然而,all-pandas 版本有效。这里有什么问题?
更新: 我认为我没有正确指定 meta/schema。我该怎么做才正确? 现在,当我删除 meta 参数时,它就可以工作了。但是,它会发出警告。我想使用 dask "correctly".
import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
def transformMyCol(x):
#Minimal Example Function
return(pd.Series(['Tom - ' + str(x),'Deskflip - ' + str(x / 8),'']))
#
## Pandas Version - Works as expected.
#
pandas_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
pandas_df.target.apply(transformMyCol,1)
#
## Dask Version (second attempt) - Raises a warning
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked = df.target.apply(transformMyCol)
unpacked.head()
#
## Dask Version (first attempt) - Raises an exception
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked_dask_schema = {"name" : str, "action" : str, "comments" : str}
unpacked = df.target.apply(transformMyCol, meta=unpacked_dask_schema)
unpacked.head()
这是我得到的错误:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
我也确认了以下内容,但它也不起作用。
meta_df = pd.DataFrame(dtype='str',columns=list(unpacked_dask_schema.keys()))
unpacked = df.FILEDATA.apply(transformMyCol, meta=meta_df)
unpacked.head()
同样的错误:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
你是对的,问题是你没有正确指定元;更具体地说,正如错误消息所说,元数据列 ("name", "action", "comments"
) 与计算数据 (0, 1, 2
) 中的列不匹配。您应该:
- 将元数据列更改为 0、1、2:
unpacked_dask_schema = dict.fromkeys(range(3), str)
df.target.apply(transformMyCol, meta=unpacked_dask_schema)
或
- 更改
transformMyCol
以使用命名列:
def transformMyCol(x):
return pd.Series({
'name': 'Tom - ' + str(x),
'action': 'Deskflip - ' + str(x / 8),
'comments': '',
}))