为什么 dask_ml.preprocessing.OrdinalEncoder.transform 会产生一个非顺序编码的结果?

Why does dask_ml.preprocessing.OrdinalEncoder.transform produce a not ordinally encoded result?

我对 dask_ml.preprocessing.OrdinalEncoder.transform 的结果感到困惑:

from sklearn.preprocessing import OrdinalEncoder
from dask_ml.preprocessing import OrdinalEncoder as DaskOrdinalEncoder
import numpy as np
import pandas as pd

N = 10
np.random.seed(1234)

df = pd.DataFrame({
    "cat1": np.random.choice(list(string.ascii_uppercase)[0:3], size=N),
    "cat2": np.random.choice(list(string.ascii_uppercase)[0:3], size=N),
})
df_dd = dd.from_pandas(df, npartitions=3)

原OrdinalEncoder.transformreturns一个numpy.ndarray(带数值):

>>> OrdinalEncoder().fit_transform(df)
array([[2., 2.],
       [1., 0.],
       [0., 0.],
       [0., 2.],
       [0., 2.],
       [1., 2.],
       [1., 0.],
       [1., 0.],
       [2., 0.],
       [2., 1.]])

dask-ml 对应物不仅通过返回 pandas.DataFrame 来破坏接口,它只是 returns 初始输入 DataFrame:

>>> DaskOrdinalEncoder().fit_transform(df_dd).compute().equals(df)
True

我期望的是(Pandas 或 Dask)DataFrame 或(Numpy 或 Dask)数组,其中包含类似于 sklearn OrdinalEncoder 生成的数值。

df_dd = df_dd.categorize(columns=["cat1", "cat2"])

在应用 OrdinalEncoder 之前需要将列标识为类别。

注意: 这在 Dask ML 文档 here. The shape of the transformed Dask DataFrame needs to be known. Using the Categorical datatyle allows for this. However, it is not the case if you just leave the data as strings. Why is the shape important? The shape is required by Dask DataFrame (df_dd) to know the number of columns that will be produced in the transformed data since all partitions of the Dask DataFrame must have the same number of columns. If we just use the str datatype then, depending on the output, Dask does not know how many columns to expect after the transformation. However, if you specify the Categorical dtype then Dask knows exactly what categories (column encodings) will be produced. An example pipeline using OneHotEncoder with a more detailed explanation is also found in the Dask ML documentation here 中有解释。类似的推理适用于 OrdinalEncoder.