实施 Dask MinMaxScaler 的问题
Problems implementing Dask MinMaxScaler
我在使用 Dask.dask_ml.preprocessing.MinMaxScaler
规范化 dask.dataframe.core.DataFrame
时遇到问题,我可以使用 sklearn.preprocessing.MinMaxScaler
,但我希望使用 dask 进行扩展。
最小的、可重现的例子:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
错误信息:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
不确定旋转 table 中的 'Categorical' 是什么,但我已尝试 .as_ordered()
索引:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
但我收到错误消息:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
附加信息
test.csv
:
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50
正在查看:
pivot_table
produces a column index which is categorical because you
made the original column "Field" categorical. Writing the index to
parquet calls reset_index on the data-frame, and pandas cannot add a
new value to the columns index, because it is categorical. You can
avoid this using ddf.columns = list(ddf.columns)
.
因此添加ddf_p.columns = list(ddf_p.columns)
解决了问题:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!
我在使用 Dask.dask_ml.preprocessing.MinMaxScaler
规范化 dask.dataframe.core.DataFrame
时遇到问题,我可以使用 sklearn.preprocessing.MinMaxScaler
,但我希望使用 dask 进行扩展。
最小的、可重现的例子:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
错误信息:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
不确定旋转 table 中的 'Categorical' 是什么,但我已尝试 .as_ordered()
索引:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
但我收到错误消息:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
附加信息
test.csv
:
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50
正在查看
pivot_table
produces a column index which is categorical because you made the original column "Field" categorical. Writing the index to parquet calls reset_index on the data-frame, and pandas cannot add a new value to the columns index, because it is categorical. You can avoid this usingddf.columns = list(ddf.columns)
.
因此添加ddf_p.columns = list(ddf_p.columns)
解决了问题:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!