python 线性回归:密集与稀疏
python linear regression: dense vs sparse
我需要在稀疏矩阵上使用线性回归。我的结果一直很差,所以我决定在一个稀疏表示的非稀疏矩阵上测试它。数据取自https://www.analyticsvidhya.com/blog/2021/05/multiple-linear-regression-using-python-and-scikit-learn/.
我已经为一些列生成了最大标准化值。 CSV 文件在这里:
https://drive.google.com/file/d/17wHv1Cc3RKgshprIKTcWUSxZOWlG68__/view?usp=sharing
运行 正常线性回归工作正常。示例代码:
df = pd.read_csv("maxnorm_50_Startups.csv")
y = pd.DataFrame()
y = df['Profit']
x = pd.DataFrame()
x = df.drop('Profit', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
LR = LinearRegression()
LR.fit(x_train, y_train)
y_prediction = LR.predict(x_test)
score=r2_score(y_test, y_prediction)
print('r2 score is', score)
样本结果:
r2 score is 0.9683831928840445
我想用稀疏矩阵重复这个。我将 CSV 转换为稀疏表示:
https://drive.google.com/file/d/1CFWbBbtiSqTSlepGuYXsxa00MSHOj-Vx/view?usp=sharing
这是我对其进行线性回归的代码:
df = pd.read_csv("maxnorm_50_Startups_relational.csv")
df['x'] = pd.to_numeric(df['x'], errors='raise')
m = len(df.x.unique())
for i in range(0, m): # randomize the 'x' values to randomize train test split
n = random.randint(0, m)
df.loc[df['x'] == n, 'x'] = m
df.loc[df['x'] == i, 'x'] = n
df.loc[df['x'] == m, 'x'] = i
y = pd.DataFrame()
y = df[df['feature'] == 'Profit']
x = pd.DataFrame()
x = df[df['feature'] != 'Profit']
y = y.drop('feature', axis=1)
x['feat'] = pd.factorize(x['feature'])[0] # sparse matrix code below can't work with strings
x_train = pd.DataFrame()
x_train = x[x['x'] <= 39]
x_test = pd.DataFrame()
x_test = x[x['x'] >= 40]
y_train = pd.DataFrame()
y_train = y[y['x'] <= 39]
y_test = pd.DataFrame()
y_test = y[y['x'] >= 40]
x_test['x'] = x_test['x'] - 40 # sparse matrix assumes that if something is numbered 50
y_test['x'] = y_test['x'] - 40 # there must be 50 records. there are 10. so renumber to 10
x_train_sparse = scipy.sparse.coo_matrix((x_train.value, (x_train.x, x_train.feat)))
# print(x_train_sparse.todense())
x_test_sparse = scipy.sparse.coo_matrix((x_test.value, (x_test.x, x_test.feat)))
LR = LinearRegression()
LR.fit(x_train_sparse, y_train)
y_prediction = LR.predict(x_test_sparse)
score = r2_score(y_test, y_prediction)
print('r2 score is', score)
运行这个,我得到负的R2分数,比如:
r2 score is -10.794519939249602
表示线性回归不起作用。我不知道我哪里错了。我尝试自己实现线性回归方程而不是使用库函数,但我仍然得到负 r2 分数。我的错误是什么?
Linear Regression
在稀疏数据上表现不佳。
还有其他线性算法,如 Ridge
、Lasso
、Bayesian Ridge
和 ElasticNet
,它们对密集数据和稀疏数据的表现相同。这些算法类似于线性回归,但它们的损失函数包含一个额外的惩罚项。
有一些非线性算法,如 RandomForestRegressor
、 GradientBoostingRegressor
、 ExtraTreesRegressor
、 XGBoostRegressor
等,它们在稀疏矩阵和密集矩阵上的表现也一样。
我建议您使用这些算法而不是简单的线性回归。
我需要在稀疏矩阵上使用线性回归。我的结果一直很差,所以我决定在一个稀疏表示的非稀疏矩阵上测试它。数据取自https://www.analyticsvidhya.com/blog/2021/05/multiple-linear-regression-using-python-and-scikit-learn/.
我已经为一些列生成了最大标准化值。 CSV 文件在这里: https://drive.google.com/file/d/17wHv1Cc3RKgshprIKTcWUSxZOWlG68__/view?usp=sharing
运行 正常线性回归工作正常。示例代码:
df = pd.read_csv("maxnorm_50_Startups.csv")
y = pd.DataFrame()
y = df['Profit']
x = pd.DataFrame()
x = df.drop('Profit', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
LR = LinearRegression()
LR.fit(x_train, y_train)
y_prediction = LR.predict(x_test)
score=r2_score(y_test, y_prediction)
print('r2 score is', score)
样本结果:
r2 score is 0.9683831928840445
我想用稀疏矩阵重复这个。我将 CSV 转换为稀疏表示: https://drive.google.com/file/d/1CFWbBbtiSqTSlepGuYXsxa00MSHOj-Vx/view?usp=sharing
这是我对其进行线性回归的代码:
df = pd.read_csv("maxnorm_50_Startups_relational.csv")
df['x'] = pd.to_numeric(df['x'], errors='raise')
m = len(df.x.unique())
for i in range(0, m): # randomize the 'x' values to randomize train test split
n = random.randint(0, m)
df.loc[df['x'] == n, 'x'] = m
df.loc[df['x'] == i, 'x'] = n
df.loc[df['x'] == m, 'x'] = i
y = pd.DataFrame()
y = df[df['feature'] == 'Profit']
x = pd.DataFrame()
x = df[df['feature'] != 'Profit']
y = y.drop('feature', axis=1)
x['feat'] = pd.factorize(x['feature'])[0] # sparse matrix code below can't work with strings
x_train = pd.DataFrame()
x_train = x[x['x'] <= 39]
x_test = pd.DataFrame()
x_test = x[x['x'] >= 40]
y_train = pd.DataFrame()
y_train = y[y['x'] <= 39]
y_test = pd.DataFrame()
y_test = y[y['x'] >= 40]
x_test['x'] = x_test['x'] - 40 # sparse matrix assumes that if something is numbered 50
y_test['x'] = y_test['x'] - 40 # there must be 50 records. there are 10. so renumber to 10
x_train_sparse = scipy.sparse.coo_matrix((x_train.value, (x_train.x, x_train.feat)))
# print(x_train_sparse.todense())
x_test_sparse = scipy.sparse.coo_matrix((x_test.value, (x_test.x, x_test.feat)))
LR = LinearRegression()
LR.fit(x_train_sparse, y_train)
y_prediction = LR.predict(x_test_sparse)
score = r2_score(y_test, y_prediction)
print('r2 score is', score)
运行这个,我得到负的R2分数,比如:
r2 score is -10.794519939249602
表示线性回归不起作用。我不知道我哪里错了。我尝试自己实现线性回归方程而不是使用库函数,但我仍然得到负 r2 分数。我的错误是什么?
Linear Regression
在稀疏数据上表现不佳。
还有其他线性算法,如 Ridge
、Lasso
、Bayesian Ridge
和 ElasticNet
,它们对密集数据和稀疏数据的表现相同。这些算法类似于线性回归,但它们的损失函数包含一个额外的惩罚项。
有一些非线性算法,如 RandomForestRegressor
、 GradientBoostingRegressor
、 ExtraTreesRegressor
、 XGBoostRegressor
等,它们在稀疏矩阵和密集矩阵上的表现也一样。
我建议您使用这些算法而不是简单的线性回归。