有没有一种聪明的方法可以将序号编码器(基于不同类别)应用于多个变量?

Is there a smart way to apply ordinal encoder (based on different categories) to multiple variables?

我有多个带有文本值的变量,我想通过序号编码器将其转换为数值。但是这些变量遵循不同的顺序逻辑。例如:

import pandas as pd
import numpy as np
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
     "attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
     "attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
     "attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)

您会注意到 "attr1" 和 "attr4" 共享相同的唯一值。要将文本值转换为数字:

from sklearn.preprocessing import OrdinalEncoder
# Assign attributes to different lists based on the values
attr_list1 = ["attr1", "attr4"]
attr_list2 = ["attr2"]
attr_list3 = ["attr3"]

# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]

# Initialize the encoder
encoder1 = OrdinalEncoder(categories = [cat1])
encoder2 = OrdinalEncoder(categories = [cat2])
encoder3 = OrdinalEncoder(categories = [cat3])

def ord_encode(attr_list, encoder):
    for attr in attr_list:
        data[attr] = encoder.fit_transform(data[[attr]])
    return data

data = ord_encode(attr_list1, encoder1)
data = ord_encode(attr_list2, encoder2)
data = ord_encode(attr_list3, encoder3)

我发现我的解决方案非常低效且笨拙。想象一下,我有 20 多个属性和 4 或 5 种不同的类别。我想知道有什么聪明的方法可以解决我的问题吗?

谢谢。

sklearn-pandas 可用于快速为您完成此操作。我会构建一个列到类别的映射,然后使用 DataFrameMapper 为我创建一个管道。

column_to_cat = {
    "attr1": cat1,
    "attr4": cat1,
    "attr2": cat2,
    "attr3": cat3
}

mapper_df = DataFrameMapper(
    [
        ([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
    ],
    df_out=True
)
mapper_df.fit_transform(data.copy())

完整代码:

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OrdinalEncoder
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
     "attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
     "attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
     "attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)

# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]

# Assign attributes to different lists based on the values
column_to_cat = {
    "attr1": cat1,
    "attr4": cat1,
    "attr2": cat2,
    "attr3": cat3
}


mapper_df = DataFrameMapper(
    [
        ([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
    ],
    df_out=True
)
mapper_df.fit_transform(data.copy())

如果我对问题的理解正确,还有更简洁的方法。


    enc = OrdinalEncoder()
    enc.fit(df[["Sex","Blood", "Study"]])
    df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])

来源: