当我尝试将虚拟列和缩放变量列连接在一起时，111 显示为 NaN

Question

这是我的数据

data_preprocessed = pd.read_csv("Absenteeism_preprocessed.csv")
data_preprocessed.head()   

  Reason_1    Reason_2   Reason_3    Reason_4   Month   Day of the week  Transportation 
                                                                          Expense
0    0            0          0           1        7            1             289
1    0            0          0           0        7            1             118
2    0            0          0           1        7            2             179    
3    1            0          0           0        7            3             279
4    0            0          0           1        7            3             289

这是我的一半数据，sry无法上传所有数据...

所以我这里的数据没有空值

我将所有值 >3 设置为 1，<3 设置为 0

targets = np.where(data_preprocessed['Absenteeism Time in Hours'] >
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

 data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the 
                                              week', 'Distance to Work', 'Daily Work 
                                              Load Average'],axis=1)

我的数据大小 data_with_targets.shape (700, 12)

我的输入

unscaled_inputs = data_with_targets.iloc[:, :-1]

数据的存储顺序

order = unscaled_inputs.columns.values
array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month',
   'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
   'Children', 'Pets'], dtype=object)

然后我表演了train_test_split

from sklearn.model_selection import train_test_split
train_test_split(unscaled_inputs,targets)
x_train,x_test,y_train,y_test = train_test_split(unscaled_inputs,targets, train_size = 
0.8, random_state = 50)

然后我将我的数据拆分为一个 DF 中的两个数据帧虚拟对象和另一个 DF 中的数值，以便我可以缩放数值

new_unscaled_inputs = x_train.loc[:,"Month":"Body Mass Index"]
new_unscaled_inputs_2 = x_train.loc[:,"Children":"Pets"]
dummy_1 = x_train.loc[:,"Reason_1":"Reason_4"]
dummy_2 = x_train.loc[:,"Education"]

连接两个数字变量列

new_unscaled_var = pd.concat([new_unscaled_inputs,new_unscaled_inputs_2],axis=1)

没有空值

new_unscaled_var.isnull().sum()
Month                     0
Transportation Expense    0
Age                       0
Body Mass Index           0
Children                  0
Pets                      0
dtype: int64

缩放数值

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_new = scaler.fit_transform(new_unscaled_var)
scaled_new
scaled_df = pd.DataFrame(scaled_new, columns = ['Month', 'Transportation Expense',
                                            'Age', 'Body Mass Index',
                                            'Children', 'Pets'])
scaled_df.isnull().sum()

没有空值

Month                     0
Transportation Expense    0
Age                       0
Body Mass Index           0
Children                  0
Pets                      0
dtype: int64

现在我将虚拟列连接成一个

dummy_df = pd.concat([dummy_1, dummy_2],axis=1)

仍然没有空值

dummy_df.isnull().sum()

Reason_1     0
Reason_2     0
Reason_3     0
Reason_4     0
Education    0
dtype: int64
df.shape
dummy_df.shape

现在，当我连接 scaled_df 列和 dummy_df 列时，我在每一列中得到 111 个空值

scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()

Month                     111
Transportation Expense    111
Age                       111
Body Mass Index           111
Children                  111
Pets                      111
Reason_1                  111
Reason_2                  111
Reason_3                  111
Reason_4                  111
Education                 111
dtype: int64

我不明白为什么。请帮助我理解这一点。

Answer 1

我从其他 Whosebug 线程得到了使用“reset_index = True”的答案， Link

scaled_df.reset_index(drop = True)

   Month  Transportation Expense  Age  Body Mass Index Children   Pets
0  0.454628     -0.996453     -1.131382  -1.084160    -0.925174   -0.595121
1 -1.261719     -0.666968     -0.977228  -1.771995    -0.925174   -0.595121
2  0.740685      0.171723      1.026777   2.584296    -0.016231   -0.595121
3  1.598859      0.366419      1.643394   1.208625     0.892711   0.223719
4 -0.403546     -1.026407     -0.360611  -0.396324     0.892711  -0.595121

dummy_df.reset_index(drop = True)

    Reason_1    Reason_2    Reason_3    Reason_4    Education
0      0           0           1           0            0
1      0           0           0           1            1
2      0           0           0           0            0
3      0           0           0           1            0
4      0           0           0           1            0

它在这里非常有效!!!!!

scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()

Month                     0
Transportation Expense    0
Age                       0
Body Mass Index           0
Children                  0
Pets                      0
Reason_1                  0
Reason_2                  0
Reason_3                  0
Reason_4                  0
Education                 0
dtype: int64

当我尝试将虚拟列和缩放变量列连接在一起时，111 显示为 NaN

When I try to concatenate the dummy columns and the Scaled variables columns together , 111 comes out as NaN

python

machine-learning

pandas

scikit-learn

logistic-regression