使用 Pearson 相关和线性回归的简单预测 python
simple prediction using Pearson correlation and linear regression with python
我有这样的数据集
Value Month Year
103.4 April 2006
270.6 August 2006
51.9 December 2006
156.9 February 2006
126.9 January 2006
96.8 July 2006
183.1 June 2006
266.6 March 2006
193.1 May 2006
524.7 November 2006
619.9 October 2006
129 September 2006
374.1 April 2007
260.5 August 2007
119.6 December 2007
9.9 February 2007
91.1 January 2007
106.6 July 2007
79.9 June 2007
60.5 March 2007
432.4 May 2007
128.8 November 2007
292.1 October 2007
129.3 September 2007
value是一个地区的年降雨量。让我们称之为 DistrictA。我有 2006 年到 2014 年的数据集,我需要预测 DistrictA 未来 2 年的降雨量。我从 sklearn 库中选择皮尔逊相关和线性回归来预测数据。我很困惑,我不知道如何设置 X 和 Y。我是 Python 的新手,所以每个帮助都是 valuable.Thank 你
ps..
我找到了这样的代码
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
当我打印 diabetes_X_train 它给了我这个
[[ 0.07786339]
[-0.03961813]
[ 0.01103904]
[-0.04069594]
[-0.03422907]...]
我假设这是从相关性和系数得到的 r 值。
当我打印 diabetes_Y_train 它给了我这样的东西
[ 233. 91. 111. 152. 120. .....]
我的问题是如何从降雨中获取 r 值并将其分配给 x 轴
没有最好的解决方案,但它有效。
小说明:我已经在列表中的索引上替换了月份,这是算法所必需的。
我还用';'替换了空格分隔符分隔符,因为在不同的行中有不同数量的空格并且不方便。现在您的数据是:
Value;Month;Year
103.4;April;2006
270.6;August;2006
51.9;December;2006
初始数据的文件是'data.csv'。
import pandas as pd
import sklearn.linear_model as ll
data = pd.read_csv('data.csv', sep=';')
X = data.ix[:,1:3]
y = data.ix[:,0]
month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
for i, m in enumerate(data.ix[:,1]):
data.ix[i,1] = month.index(m)
X = data.ix[:,1:3]
lr = ll.LinearRegression()
lr.fit(X, y)
######### TEST DATA ##########
X_test = [[1, 2008], [2, 2008]]
X_test = pd.DataFrame(X_test, columns=['Month', 'Year'])
y_test = lr.predict(X_test)
print(y_test)
作为测试的结果,我得到了这个值
[69.23079837 80.63691725]
我有这样的数据集
Value Month Year
103.4 April 2006
270.6 August 2006
51.9 December 2006
156.9 February 2006
126.9 January 2006
96.8 July 2006
183.1 June 2006
266.6 March 2006
193.1 May 2006
524.7 November 2006
619.9 October 2006
129 September 2006
374.1 April 2007
260.5 August 2007
119.6 December 2007
9.9 February 2007
91.1 January 2007
106.6 July 2007
79.9 June 2007
60.5 March 2007
432.4 May 2007
128.8 November 2007
292.1 October 2007
129.3 September 2007
value是一个地区的年降雨量。让我们称之为 DistrictA。我有 2006 年到 2014 年的数据集,我需要预测 DistrictA 未来 2 年的降雨量。我从 sklearn 库中选择皮尔逊相关和线性回归来预测数据。我很困惑,我不知道如何设置 X 和 Y。我是 Python 的新手,所以每个帮助都是 valuable.Thank 你
ps.. 我找到了这样的代码
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
当我打印 diabetes_X_train 它给了我这个
[[ 0.07786339]
[-0.03961813]
[ 0.01103904]
[-0.04069594]
[-0.03422907]...]
我假设这是从相关性和系数得到的 r 值。 当我打印 diabetes_Y_train 它给了我这样的东西
[ 233. 91. 111. 152. 120. .....]
我的问题是如何从降雨中获取 r 值并将其分配给 x 轴
没有最好的解决方案,但它有效。
小说明:我已经在列表中的索引上替换了月份,这是算法所必需的。 我还用';'替换了空格分隔符分隔符,因为在不同的行中有不同数量的空格并且不方便。现在您的数据是:
Value;Month;Year
103.4;April;2006
270.6;August;2006
51.9;December;2006
初始数据的文件是'data.csv'。
import pandas as pd
import sklearn.linear_model as ll
data = pd.read_csv('data.csv', sep=';')
X = data.ix[:,1:3]
y = data.ix[:,0]
month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
for i, m in enumerate(data.ix[:,1]):
data.ix[i,1] = month.index(m)
X = data.ix[:,1:3]
lr = ll.LinearRegression()
lr.fit(X, y)
######### TEST DATA ##########
X_test = [[1, 2008], [2, 2008]]
X_test = pd.DataFrame(X_test, columns=['Month', 'Year'])
y_test = lr.predict(X_test)
print(y_test)
作为测试的结果,我得到了这个值
[69.23079837 80.63691725]