Labelencoder 和 OneHotEncoder 在同一个 for 循环中

Labelencoder and OneHotEncoder within the same for loop

我正在编写一个 for 循环来尝试对数据集中的所有值进行编码。我有很多分类值,最初 for 循环适用于标签编码器,但我试图包含一个 onehotencoder 而不是在单独的行上使用 get_dummies。

示例数据:

               STYP_DESC  Gender       RACE_DESC DEGREE               MAJR_DESC1 FTPT  Target
0                   New  Female           White     BA  Business Administration   FT       1
1  New 1st Time Freshmn  Female           White     BA               Studio Art   FT       1
2                   New    Male           White   MBAX  Business Administration   FT       1
3                   New  Female         Unknown     JD             Juris Doctor   PT       1
4                   New  Female  Asian-American   MBAX  Business Administration   PT       1       

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')
le_count = 0
enc_count = 0
for col in X_train.columns[1:]:
    if X_train[col].dtype == 'object':
        if len(list(X_train[col].unique())) <= 2:
            le.fit(X_train[col])
            X_train[col] = le.transform(X_train[col])
            le_count += 1
        else:
            enc.fit(X_train[[col]])
            X_train[[col]] = enc.transform(X_train[[col]])
            enc_count +=1
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))

但是当我 运行 它时,我没有收到错误,但是编码非常奇怪,大量元组被插入到我的新数据集中。

当我 运行 没有 else 子句中的所有内容的代码时,它 运行 没问题,我可以简单地使用 get_dummies 来编码其他变量。

唯一的问题是当我使用 get_dummies 时,我 drop_first 设置为 true;但我忘记了什么应该是 0 什么应该是 1。(即这个问题是跟踪性别和 FTPT 的主要问题。

对此有什么建议吗?我会使用 get_dummies,但由于我在 拆分数据后进行预处理阶段 ,我担心某个类别可能会被丢弃。

改变变换线编码else部分如下

X_train[col] = enc.transform(X_train[[col]]).toarray()

这里我复制了完整的代码,你可以直接试试。 所以错误可能是您代码的其他部分,请检查。

styp = ['New','New 1st Time Freshmn','New','New','New']
gend = ['Female','Female','Male','Female','Female']
race = ['White','White','Unknown','Unknown','Asian-American']
deg = ['BA','BA','MBAX','JD','MBAX']
maj = ['Business Administration','Studio Art','Business Administration','Juris Doctor','Business Administration']
ftpt = ['FT','FT','FT','PT','PT']

df = pd.DataFrame({'STYP_DESC':styp, 'Gender':gend, 'RACE_DESC':race,'DEGREE':deg,\
     'MAJR_DESC1':maj, 'FTPT':ftpt})

le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore',drop='first')

le_count = 0
enc_count = 0

for col in df.columns[1:]:
    if df[col].dtype == 'object':
        if len(list(df[col].unique())) <= 2:
            le.fit(df[col])
            df[col] = le.transform(df[col])
            le_count += 1
        else:
            enc.fit(df[[col]])
            df[col] = enc.transform(df[[col]]).toarray()
            enc_count +=1
print(df)
print('{} columns were label encoded and {} columns were 1-hot encoded'.format(le_count, enc_count))