在 python 中使用线性回归估算缺失值
imputing missing values using a linear regression in python
我正在尝试使用线性回归
来估算 pandas 数据框中的缺失值
`
for index in [missing_data_df.horsepower.index]:
i = 0
if pd.isnull(missing_data_df.horsepower[index[i]]):
#linear regression equation
a = 0.25743277 * missing_data_df.displacement[index[i]] + 0.00958711 *
missing_data_df.weight[index[i]] + 25.874947903262651
# replacing "nan" values in dataframe using .set_value
missing_data_df.set_value(index[i],"horsepower",a)
i+=1
`
它正在执行。但数据框中的缺失值 (nan) 未被变量 'a' 中的线性回归预测值替换。有什么建议吗?
下面是包含缺失数据的数据框
`
>>> missing_data_df:
mpg cylinders displacement horsepower weight acceleration \
10 NaN 4.0 133.0 115.0 3090.0 17.5
11 NaN 8.0 350.0 165.0 4142.0 11.5
12 NaN 8.0 351.0 153.0 4034.0 11.0
13 NaN 8.0 383.0 175.0 4166.0 10.5
14 NaN 8.0 360.0 175.0 3850.0 11.0
17 NaN 8.0 302.0 140.0 3353.0 8.0
38 25.0 4.0 98.0 NaN 2046.0 19.0
39 NaN 4.0 97.0 48.0 1978.0 20.0
133 21.0 6.0 200.0 NaN 2875.0 17.0
337 40.9 4.0 85.0 NaN 1835.0 17.3
343 23.6 4.0 140.0 NaN 2905.0 14.3
361 34.5 4.0 100.0 NaN 2320.0 15.8
367 NaN 4.0 121.0 110.0 2800.0 15.4
382 23.0 4.0 151.0 NaN 3035.0 20.5
model_year origin car_name
10 70.0 2.0 citroen ds-21 pallas
11 70.0 1.0 chevrolet chevelle concours (sw)
12 70.0 1.0 ford torino (sw)
13 70.0 1.0 plymouth satellite (sw)
14 70.0 1.0 amc rebel sst (sw)
17 70.0 1.0 ford mustang boss 302
38 71.0 1.0 ford pinto
39 71.0 2.0 volkswagen super beetle 117
133 74.0 1.0 ford maverick
337 80.0 2.0 renault lecar deluxe
343 80.0 1.0 ford mustang cobra
361 81.0 2.0 renault 18i
367 81.0 2.0 saab 900s
382 82.0 1.0 amc concord dl
`
您可以为此使用 apply 和 lambda:
missing_data_df['horsepower']= missing_data_df.apply(
lambda row:
0.25743277 * row.displacement + 0.00958711 * row.weight + 25.874947903262651
if np.isnan(row.horsepower) else row.horsepower, axis=1)
几件事
- missing_data_df.horsepower 没有缺失值
- missing_data_df.weight,您公式中的一个变量确实有缺失值
- if hp = 0.25743277 * disp + 0.00958711 * weight + 25.874947903262651
然后权重 = (0.25743277 * disp + 25.874947903262651 - hp) / -0.00958711
要计算重量试试
for idx in missing_data_df.index:
if pd.isnull(missing_data_df.loc[idx,"weight"]):
disp = missing_data_df.loc[idx,"displacement"]
hp = missing_data_df.loc[idx,"horsepower"]
missing_data_df.loc[idx,"weight"] = (0.25743277 * disp + 25.874947903262651 - hp) / -0.00958711
一般来说,.loc[]
和 .iloc[]
是查找或设置值的更好方法
我正在尝试使用线性回归
来估算 pandas 数据框中的缺失值`
for index in [missing_data_df.horsepower.index]:
i = 0
if pd.isnull(missing_data_df.horsepower[index[i]]):
#linear regression equation
a = 0.25743277 * missing_data_df.displacement[index[i]] + 0.00958711 *
missing_data_df.weight[index[i]] + 25.874947903262651
# replacing "nan" values in dataframe using .set_value
missing_data_df.set_value(index[i],"horsepower",a)
i+=1
`
它正在执行。但数据框中的缺失值 (nan) 未被变量 'a' 中的线性回归预测值替换。有什么建议吗?
下面是包含缺失数据的数据框 `
>>> missing_data_df:
mpg cylinders displacement horsepower weight acceleration \
10 NaN 4.0 133.0 115.0 3090.0 17.5
11 NaN 8.0 350.0 165.0 4142.0 11.5
12 NaN 8.0 351.0 153.0 4034.0 11.0
13 NaN 8.0 383.0 175.0 4166.0 10.5
14 NaN 8.0 360.0 175.0 3850.0 11.0
17 NaN 8.0 302.0 140.0 3353.0 8.0
38 25.0 4.0 98.0 NaN 2046.0 19.0
39 NaN 4.0 97.0 48.0 1978.0 20.0
133 21.0 6.0 200.0 NaN 2875.0 17.0
337 40.9 4.0 85.0 NaN 1835.0 17.3
343 23.6 4.0 140.0 NaN 2905.0 14.3
361 34.5 4.0 100.0 NaN 2320.0 15.8
367 NaN 4.0 121.0 110.0 2800.0 15.4
382 23.0 4.0 151.0 NaN 3035.0 20.5
model_year origin car_name
10 70.0 2.0 citroen ds-21 pallas
11 70.0 1.0 chevrolet chevelle concours (sw)
12 70.0 1.0 ford torino (sw)
13 70.0 1.0 plymouth satellite (sw)
14 70.0 1.0 amc rebel sst (sw)
17 70.0 1.0 ford mustang boss 302
38 71.0 1.0 ford pinto
39 71.0 2.0 volkswagen super beetle 117
133 74.0 1.0 ford maverick
337 80.0 2.0 renault lecar deluxe
343 80.0 1.0 ford mustang cobra
361 81.0 2.0 renault 18i
367 81.0 2.0 saab 900s
382 82.0 1.0 amc concord dl
`
您可以为此使用 apply 和 lambda:
missing_data_df['horsepower']= missing_data_df.apply(
lambda row:
0.25743277 * row.displacement + 0.00958711 * row.weight + 25.874947903262651
if np.isnan(row.horsepower) else row.horsepower, axis=1)
几件事
- missing_data_df.horsepower 没有缺失值
- missing_data_df.weight,您公式中的一个变量确实有缺失值
- if hp = 0.25743277 * disp + 0.00958711 * weight + 25.874947903262651
然后权重 = (0.25743277 * disp + 25.874947903262651 - hp) / -0.00958711
要计算重量试试
for idx in missing_data_df.index:
if pd.isnull(missing_data_df.loc[idx,"weight"]):
disp = missing_data_df.loc[idx,"displacement"]
hp = missing_data_df.loc[idx,"horsepower"]
missing_data_df.loc[idx,"weight"] = (0.25743277 * disp + 25.874947903262651 - hp) / -0.00958711
一般来说,.loc[]
和 .iloc[]
是查找或设置值的更好方法