有没有一种聪明的方法可以将序号编码器(基于不同类别)应用于多个变量?
Is there a smart way to apply ordinal encoder (based on different categories) to multiple variables?
我有多个带有文本值的变量,我想通过序号编码器将其转换为数值。但是这些变量遵循不同的顺序逻辑。例如:
import pandas as pd
import numpy as np
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
"attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
"attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
"attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)
您会注意到 "attr1" 和 "attr4" 共享相同的唯一值。要将文本值转换为数字:
from sklearn.preprocessing import OrdinalEncoder
# Assign attributes to different lists based on the values
attr_list1 = ["attr1", "attr4"]
attr_list2 = ["attr2"]
attr_list3 = ["attr3"]
# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]
# Initialize the encoder
encoder1 = OrdinalEncoder(categories = [cat1])
encoder2 = OrdinalEncoder(categories = [cat2])
encoder3 = OrdinalEncoder(categories = [cat3])
def ord_encode(attr_list, encoder):
for attr in attr_list:
data[attr] = encoder.fit_transform(data[[attr]])
return data
data = ord_encode(attr_list1, encoder1)
data = ord_encode(attr_list2, encoder2)
data = ord_encode(attr_list3, encoder3)
我发现我的解决方案非常低效且笨拙。想象一下,我有 20 多个属性和 4 或 5 种不同的类别。我想知道有什么聪明的方法可以解决我的问题吗?
谢谢。
sklearn-pandas
可用于快速为您完成此操作。我会构建一个列到类别的映射,然后使用 DataFrameMapper
为我创建一个管道。
column_to_cat = {
"attr1": cat1,
"attr4": cat1,
"attr2": cat2,
"attr3": cat3
}
mapper_df = DataFrameMapper(
[
([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
],
df_out=True
)
mapper_df.fit_transform(data.copy())
完整代码:
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OrdinalEncoder
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
"attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
"attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
"attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)
# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]
# Assign attributes to different lists based on the values
column_to_cat = {
"attr1": cat1,
"attr4": cat1,
"attr2": cat2,
"attr3": cat3
}
mapper_df = DataFrameMapper(
[
([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
],
df_out=True
)
mapper_df.fit_transform(data.copy())
如果我对问题的理解正确,还有更简洁的方法。
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])
来源:
我有多个带有文本值的变量,我想通过序号编码器将其转换为数值。但是这些变量遵循不同的顺序逻辑。例如:
import pandas as pd
import numpy as np
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
"attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
"attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
"attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)
您会注意到 "attr1" 和 "attr4" 共享相同的唯一值。要将文本值转换为数字:
from sklearn.preprocessing import OrdinalEncoder
# Assign attributes to different lists based on the values
attr_list1 = ["attr1", "attr4"]
attr_list2 = ["attr2"]
attr_list3 = ["attr3"]
# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]
# Initialize the encoder
encoder1 = OrdinalEncoder(categories = [cat1])
encoder2 = OrdinalEncoder(categories = [cat2])
encoder3 = OrdinalEncoder(categories = [cat3])
def ord_encode(attr_list, encoder):
for attr in attr_list:
data[attr] = encoder.fit_transform(data[[attr]])
return data
data = ord_encode(attr_list1, encoder1)
data = ord_encode(attr_list2, encoder2)
data = ord_encode(attr_list3, encoder3)
我发现我的解决方案非常低效且笨拙。想象一下,我有 20 多个属性和 4 或 5 种不同的类别。我想知道有什么聪明的方法可以解决我的问题吗?
谢谢。
sklearn-pandas
可用于快速为您完成此操作。我会构建一个列到类别的映射,然后使用 DataFrameMapper
为我创建一个管道。
column_to_cat = {
"attr1": cat1,
"attr4": cat1,
"attr2": cat2,
"attr3": cat3
}
mapper_df = DataFrameMapper(
[
([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
],
df_out=True
)
mapper_df.fit_transform(data.copy())
完整代码:
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OrdinalEncoder
d = {"attr1":["Excellent", "Fair", "Fair", "Good", "Poor"],
"attr4":["Fair", "Good", "Good", "Excellent", "Excellent"],
"attr2":["Finished", "Unfinished", "Partially Finished", "Finished", "Unfinished"],
"attr3":["Satisfied", "Unsatisfied", "Unsatisfied", "Satisfied", "Satisfied"]}
data = pd.DataFrame(data = d)
# Create categories to instruct how ordinal encoder should work
cat1 = ["Poor", "Fair", "Good", "Excellent"]
cat2 = ["Unfinished", "Partially Finished", "Finished"]
cat3 = ["Unsatisfied", "Satisfied"]
# Assign attributes to different lists based on the values
column_to_cat = {
"attr1": cat1,
"attr4": cat1,
"attr2": cat2,
"attr3": cat3
}
mapper_df = DataFrameMapper(
[
([col], OrdinalEncoder(categories = [cat])) for col, cat in column_to_cat.items()
],
df_out=True
)
mapper_df.fit_transform(data.copy())
如果我对问题的理解正确,还有更简洁的方法。
enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])
来源: