线性回归 - 如何预测估计的相对性能?

Linear Regression - how to predict the estimated relative performance?

Paul need a laptop that is fast enough. One of the main parameter of computers which he must focus on is CPU. In this project we need to forecast performance of CPU which is characterized in terms of cycle time and memory capacity and so on.

这是线性回归问题,您应该预测估计的相对性能列。

我是 Python 的新人。谁能帮我完成这个任务的代码?

CSV file(在 Google 驱动器上)

这就是我所做的。不过可能是我没看懂吧

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

data = pd.read_csv("Computer_Hardware.csv")
data
data.describe()

y = data["Machine Cycle Time in nanoseconds"]
x1 = data["Estimated Relative Performance"]
plt.scatter(x1,y)
plt.xlabel("Estimated Relative Performance", fontsize = 20)
plt.ylabel("Machine Cycle Time in nanoseconds", fontsize = 20)
plt.show()

x = sm.add_constant(x1)
x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
results.summary()

如果您没有太多 Python 经验,Keras 是最容易使用的库之一。这是一个很好的教程:https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

在来自 statsmodels 的任何拟合模型中,您可以使用方法 predict() 提取预测值,然后将它们添加到您的框架中。

data['predicted'] = results.predict()

也许您的模型需要更多工作,目前,它只使用一个变量,也许您可​​以使用另一个使用更多变量的模型获得更好的预测。

y = b0 + b1 * x1

根据文本“... CPU 其特征在于循环时间和内存容量等”是问题所在。

一项提议是使用 statsmodels API 来扩展您的模型以编写公式。在你的情况下,我想删除之前列名中的所有空格。

# Rename columns without spaces
old_columns = data.columns
new_columns = [col.replace(' ', '_') for col in old_columns]
data = data.rename(columns={old:new for old, new in zip(old_columns, new_columns)})

# Fit a model using more variables
import statsmodels.formula.api as sm2

formula = ('Estimated_Relative_Performance ~ ', 
           'Machine_Cycle_Time_in_nanoseconds + ',
           'Maximum_Main_Memory_in_kilobytes + ', 
           'Cache_Memory_in_Kilobytes + ', 
           'Maximum_Channels_in_Units')
formula = ' '.join(formula)
print(formula)

results2 = sm2.ols(formula, data).fit()
results2.summary()

data['predicted2'] = results2.predict()


我研究了一些关于这个主题的博客,并得到了这个代码:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

raw_data = pd.read_csv("Computer_Hardware.csv")

x = raw_data[['Machine Cycle Time in nanoseconds',
       'Minimum Main Memory in Kilobytes', 'Maximum Main Memory in kilobytes',
       'Cache Memory in Kilobytes', 'Minimum Channels in Units',
       'Maximum Channels in Units', 'Published Relative Performance']]

y = raw_data['Estimated Relative Performance']

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(x_train, y_train)

print(model.coef_)

print(model.intercept_)

pd.DataFrame(model.coef_, x.columns, columns = ['Coeff'])

predictions = model.predict(x_test)

plt.hist(y_test - predictions)

from sklearn import metrics

metrics.mean_absolute_error(y_test, predictions)

metrics.mean_squared_error(y_test, predictions)

np.sqrt(metrics.mean_squared_error(y_test, predictions))