Labelencoder 和 OneHotEncoder 在同一个 for 循环中
Labelencoder and OneHotEncoder within the same for loop
我正在编写一个 for 循环来尝试对数据集中的所有值进行编码。我有很多分类值,最初 for 循环适用于标签编码器,但我试图包含一个 onehotencoder 而不是在单独的行上使用 get_dummies。
示例数据:
STYP_DESC Gender RACE_DESC DEGREE MAJR_DESC1 FTPT Target
0 New Female White BA Business Administration FT 1
1 New 1st Time Freshmn Female White BA Studio Art FT 1
2 New Male White MBAX Business Administration FT 1
3 New Female Unknown JD Juris Doctor PT 1
4 New Female Asian-American MBAX Business Administration PT 1
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
if X_train[col].dtype == 'object':
if len(list(X_train[col].unique())) <= 2:
le.fit(X_train[col])
X_train[col] = le.transform(X_train[col])
le_count += 1
else:
enc.fit(X_train[[col]])
X_train[[col]] = enc.transform(X_train[[col]])
enc_count +=1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
但是当我 运行 它时,我没有收到错误,但是编码非常奇怪,大量元组被插入到我的新数据集中。
当我 运行 没有 else 子句中的所有内容的代码时,它 运行 没问题,我可以简单地使用 get_dummies 来编码其他变量。
唯一的问题是当我使用 get_dummies 时,我 drop_first 设置为 true;但我忘记了什么应该是 0 什么应该是 1。(即这个问题是跟踪性别和 FTPT 的主要问题。
对此有什么建议吗?我会使用 get_dummies,但由于我在 拆分数据后进行预处理阶段 ,我担心某个类别可能会被丢弃。
改变变换线编码else部分如下
X_train[col] = enc.transform(X_train[[col]]).toarray()
这里我复制了完整的代码,你可以直接试试。
所以错误可能是您代码的其他部分,请检查。
styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']
df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
'MAJR_DESC1':maj, 'FTPT':ftpt})
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in df.columns[1:]:
if df[col].dtype == 'object':
if len(list(df[col].unique())) <= 2:
le.fit(df[col])
df[col] = le.transform(df[col])
le_count += 1
else:
enc.fit(df[[col]])
df[col] = enc.transform(df[[col]]).toarray()
enc_count +=1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
我正在编写一个 for 循环来尝试对数据集中的所有值进行编码。我有很多分类值,最初 for 循环适用于标签编码器,但我试图包含一个 onehotencoder 而不是在单独的行上使用 get_dummies。
示例数据:
STYP_DESC Gender RACE_DESC DEGREE MAJR_DESC1 FTPT Target
0 New Female White BA Business Administration FT 1
1 New 1st Time Freshmn Female White BA Studio Art FT 1
2 New Male White MBAX Business Administration FT 1
3 New Female Unknown JD Juris Doctor PT 1
4 New Female Asian-American MBAX Business Administration PT 1
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
if X_train[col].dtype == 'object':
if len(list(X_train[col].unique())) <= 2:
le.fit(X_train[col])
X_train[col] = le.transform(X_train[col])
le_count += 1
else:
enc.fit(X_train[[col]])
X_train[[col]] = enc.transform(X_train[[col]])
enc_count +=1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))
但是当我 运行 它时,我没有收到错误,但是编码非常奇怪,大量元组被插入到我的新数据集中。
当我 运行 没有 else 子句中的所有内容的代码时,它 运行 没问题,我可以简单地使用 get_dummies 来编码其他变量。
唯一的问题是当我使用 get_dummies 时,我 drop_first 设置为 true;但我忘记了什么应该是 0 什么应该是 1。(即这个问题是跟踪性别和 FTPT 的主要问题。
对此有什么建议吗?我会使用 get_dummies,但由于我在 拆分数据后进行预处理阶段 ,我担心某个类别可能会被丢弃。
改变变换线编码else部分如下
X_train[col] = enc.transform(X_train[[col]]).toarray()
这里我复制了完整的代码,你可以直接试试。 所以错误可能是您代码的其他部分,请检查。
styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']
df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
'MAJR_DESC1':maj, 'FTPT':ftpt})
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in df.columns[1:]:
if df[col].dtype == 'object':
if len(list(df[col].unique())) <= 2:
le.fit(df[col])
df[col] = le.transform(df[col])
le_count += 1
else:
enc.fit(df[[col]])
df[col] = enc.transform(df[[col]]).toarray()
enc_count +=1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))