当我尝试将虚拟列和缩放变量列连接在一起时,111 显示为 NaN
When I try to concatenate the dummy columns and the Scaled variables columns together , 111 comes out as NaN
这是我的数据
data_preprocessed = pd.read_csv("Absenteeism_preprocessed.csv")
data_preprocessed.head()
Reason_1 Reason_2 Reason_3 Reason_4 Month Day of the week Transportation
Expense
0 0 0 0 1 7 1 289
1 0 0 0 0 7 1 118
2 0 0 0 1 7 2 179
3 1 0 0 0 7 3 279
4 0 0 0 1 7 3 289
这是我的一半数据,sry无法上传所有数据...
所以我这里的数据没有空值
我将所有值 >3 设置为 1,<3 设置为 0
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] >
data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the
week', 'Distance to Work', 'Daily Work
Load Average'],axis=1)
我的数据大小
data_with_targets.shape
(700, 12)
我的输入
unscaled_inputs = data_with_targets.iloc[:, :-1]
数据的存储顺序
order = unscaled_inputs.columns.values
array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month',
'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
'Children', 'Pets'], dtype=object)
然后我表演了train_test_split
from sklearn.model_selection import train_test_split
train_test_split(unscaled_inputs,targets)
x_train,x_test,y_train,y_test = train_test_split(unscaled_inputs,targets, train_size =
0.8, random_state = 50)
然后我将我的数据拆分为一个 DF 中的两个数据帧虚拟对象和另一个 DF 中的数值,以便我可以缩放数值
new_unscaled_inputs = x_train.loc[:,"Month":"Body Mass Index"]
new_unscaled_inputs_2 = x_train.loc[:,"Children":"Pets"]
dummy_1 = x_train.loc[:,"Reason_1":"Reason_4"]
dummy_2 = x_train.loc[:,"Education"]
连接两个数字变量列
new_unscaled_var = pd.concat([new_unscaled_inputs,new_unscaled_inputs_2],axis=1)
没有空值
new_unscaled_var.isnull().sum()
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
dtype: int64
缩放数值
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_new = scaler.fit_transform(new_unscaled_var)
scaled_new
scaled_df = pd.DataFrame(scaled_new, columns = ['Month', 'Transportation Expense',
'Age', 'Body Mass Index',
'Children', 'Pets'])
scaled_df.isnull().sum()
没有空值
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
dtype: int64
现在我将虚拟列连接成一个
dummy_df = pd.concat([dummy_1, dummy_2],axis=1)
仍然没有空值
dummy_df.isnull().sum()
Reason_1 0
Reason_2 0
Reason_3 0
Reason_4 0
Education 0
dtype: int64
df.shape
dummy_df.shape
现在,当我连接 scaled_df 列和 dummy_df 列时,我在每一列中得到 111 个空值
scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()
Month 111
Transportation Expense 111
Age 111
Body Mass Index 111
Children 111
Pets 111
Reason_1 111
Reason_2 111
Reason_3 111
Reason_4 111
Education 111
dtype: int64
我不明白为什么。请帮助我理解这一点。
我从其他 Whosebug 线程得到了使用“reset_index = True”的答案,
Link
scaled_df.reset_index(drop = True)
Month Transportation Expense Age Body Mass Index Children Pets
0 0.454628 -0.996453 -1.131382 -1.084160 -0.925174 -0.595121
1 -1.261719 -0.666968 -0.977228 -1.771995 -0.925174 -0.595121
2 0.740685 0.171723 1.026777 2.584296 -0.016231 -0.595121
3 1.598859 0.366419 1.643394 1.208625 0.892711 0.223719
4 -0.403546 -1.026407 -0.360611 -0.396324 0.892711 -0.595121
dummy_df.reset_index(drop = True)
Reason_1 Reason_2 Reason_3 Reason_4 Education
0 0 0 1 0 0
1 0 0 0 1 1
2 0 0 0 0 0
3 0 0 0 1 0
4 0 0 0 1 0
它在这里非常有效!!!!!
scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
Reason_1 0
Reason_2 0
Reason_3 0
Reason_4 0
Education 0
dtype: int64
这是我的数据
data_preprocessed = pd.read_csv("Absenteeism_preprocessed.csv")
data_preprocessed.head()
Reason_1 Reason_2 Reason_3 Reason_4 Month Day of the week Transportation
Expense
0 0 0 0 1 7 1 289
1 0 0 0 0 7 1 118
2 0 0 0 1 7 2 179
3 1 0 0 0 7 3 279
4 0 0 0 1 7 3 289
这是我的一半数据,sry无法上传所有数据...
所以我这里的数据没有空值
我将所有值 >3 设置为 1,<3 设置为 0
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] >
data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the
week', 'Distance to Work', 'Daily Work
Load Average'],axis=1)
我的数据大小 data_with_targets.shape (700, 12)
我的输入
unscaled_inputs = data_with_targets.iloc[:, :-1]
数据的存储顺序
order = unscaled_inputs.columns.values
array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month',
'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
'Children', 'Pets'], dtype=object)
然后我表演了train_test_split
from sklearn.model_selection import train_test_split
train_test_split(unscaled_inputs,targets)
x_train,x_test,y_train,y_test = train_test_split(unscaled_inputs,targets, train_size =
0.8, random_state = 50)
然后我将我的数据拆分为一个 DF 中的两个数据帧虚拟对象和另一个 DF 中的数值,以便我可以缩放数值
new_unscaled_inputs = x_train.loc[:,"Month":"Body Mass Index"]
new_unscaled_inputs_2 = x_train.loc[:,"Children":"Pets"]
dummy_1 = x_train.loc[:,"Reason_1":"Reason_4"]
dummy_2 = x_train.loc[:,"Education"]
连接两个数字变量列
new_unscaled_var = pd.concat([new_unscaled_inputs,new_unscaled_inputs_2],axis=1)
没有空值
new_unscaled_var.isnull().sum()
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
dtype: int64
缩放数值
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_new = scaler.fit_transform(new_unscaled_var)
scaled_new
scaled_df = pd.DataFrame(scaled_new, columns = ['Month', 'Transportation Expense',
'Age', 'Body Mass Index',
'Children', 'Pets'])
scaled_df.isnull().sum()
没有空值
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
dtype: int64
现在我将虚拟列连接成一个
dummy_df = pd.concat([dummy_1, dummy_2],axis=1)
仍然没有空值
dummy_df.isnull().sum()
Reason_1 0
Reason_2 0
Reason_3 0
Reason_4 0
Education 0
dtype: int64
df.shape
dummy_df.shape
现在,当我连接 scaled_df 列和 dummy_df 列时,我在每一列中得到 111 个空值
scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()
Month 111
Transportation Expense 111
Age 111
Body Mass Index 111
Children 111
Pets 111
Reason_1 111
Reason_2 111
Reason_3 111
Reason_4 111
Education 111
dtype: int64
我不明白为什么。请帮助我理解这一点。
我从其他 Whosebug 线程得到了使用“reset_index = True”的答案,
Link
scaled_df.reset_index(drop = True)
Month Transportation Expense Age Body Mass Index Children Pets
0 0.454628 -0.996453 -1.131382 -1.084160 -0.925174 -0.595121
1 -1.261719 -0.666968 -0.977228 -1.771995 -0.925174 -0.595121
2 0.740685 0.171723 1.026777 2.584296 -0.016231 -0.595121
3 1.598859 0.366419 1.643394 1.208625 0.892711 0.223719
4 -0.403546 -1.026407 -0.360611 -0.396324 0.892711 -0.595121
dummy_df.reset_index(drop = True)
Reason_1 Reason_2 Reason_3 Reason_4 Education
0 0 0 1 0 0
1 0 0 0 1 1
2 0 0 0 0 0
3 0 0 0 1 0
4 0 0 0 1 0
它在这里非常有效!!!!!
scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
Reason_1 0
Reason_2 0
Reason_3 0
Reason_4 0
Education 0
dtype: int64