梯度下降计算错误

Question

我尝试编写一个函数来计算线性回归模型的梯度下降。然而，我得到的答案与我使用正规方程法得到的答案不匹配。

我的示例数据是：

df <- data.frame(c(1,5,6),c(3,5,6),c(4,6,8))

其中 c(4,6,8) 是 y 值。

lm_gradient_descent <- function(df,learning_rate, y_col=length(df),scale=TRUE){

n_features <- length(df) #n_features is the number of features in the data set

#using mean normalization to scale features

if(scale==TRUE){

for (i in 1:(n_features)){
  df[,i] <- (df[,i]-mean(df[,i]))/sd(df[,i])
    }
  }
  y_data <- df[,y_col]
  df[,y_col] <- NULL
  par <- rep(1,n_features)
  df <- merge(1,df)
  data_mat <- data.matrix(df)
  #we need a temp_arr to store each iteration of parameter values so that we can do a 
  #simultaneous update
  temp_arr <- rep(0,n_features)
  diff <- 1
  while(diff>0.0000001){
    for (i in 1:(n_features)){
      temp_arr[i] <- par[i]-learning_rate*sum((data_mat%*%par-y_data)*df[,i])/length(y_data)
    }
    diff <- par[1]-temp_arr[1]
    print(diff)
    par <- temp_arr
  }

  return(par)
}

运行这个函数，

lm_gradient_descent(df,0.0001,,0)

我得到的结果是

c(0.9165891,0.6115482,0.5652970)

当我使用正规方程法时，我得到

c(2,1,0).

希望有人能指出我在这个函数中哪里出错了。

Answer 1

看来你没有实现偏置项。在这样的线性模型中，您总是希望有一个额外的附加常数，即您的模型应该像

w_0 + w_1*x_1 + ... + w_n*x_n.

如果没有 w_0 项，您通常不会很合适。

Answer 2

您使用了停止标准

old parameters - new parameters <= 0.0000001

首先，我认为如果您想使用此标准，则缺少 abs()（尽管我对 R 的无知可能是错误的）。但即使你使用

abs(old parameters - new parameters) <= 0.0000001

这不是一个很好的停止标准：它只是告诉你进度变慢了，而不是它已经足够准确了。尝试简单地迭代固定次数的迭代。不幸的是，在这里给出一个良好的、普遍适用的梯度下降停止标准并不容易。

Answer 3

我知道这已经有几周的时间了，但出于几个原因我要尝试一下，即

对 R 比较陌生，所以破译你的代码并重写它对我来说是个好习惯
正在处理不同的梯度下降问题，所以这对我来说很新鲜
需要堆栈流点和
据我所知，您从未得到有效的答案。

首先，关于你的数据结构。你从一个数据框开始，重命名一列，去掉一个向量，然后去掉一个矩阵。从 X 矩阵（大写，因为它的组件 'features' 被称为 x 下标 i）和 y 开始会容易得多解决方案向量。

X <- cbind(c(1,5,6),c(3,5,6))
y <- c(4,6,8)

我们可以很容易地看到所需的解决方案是什么，通过拟合线性拟合模型进行缩放和不缩放。（注意我们只缩放 X/features 而不是 y/solutions）

> lm(y~X)

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)           X1           X2  
         -4           -1            3  

> lm(y~scale(X))

Call:
lm(formula = y ~ scale(X))

Coefficients:
(Intercept)    scale(X)1    scale(X)2  
      6.000       -2.646        4.583

关于您的代码，R 的优点之一是它可以执行矩阵乘法，这比使用循环快得多。

lm_gradient_descent <- function(X, y, learning_rate, scale=TRUE){

  if(scale==TRUE){X <- scale(X)}

  X <- cbind(1, X)

  theta <- rep(0, ncol(X)) #your old temp_arr
  diff <- 1
  old.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
  while(diff>0.000000001){
    theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
    new.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
    diff <- abs(old.error - new.error)
    old.error <- new.error
  }
  return(theta)
}

并证明它有效...

> lm_gradient_descent(X, y, .01, 0)
           [,1]
[1,] -3.9360685
[2,] -0.9851775
[3,]  2.9736566

与预期的 (-4, -1, 3)

尽管我同意@cfh 的观点，我更喜欢具有定义的迭代次数的循环，但实际上我不确定您是否需要 abs 函数。如果 diff < 0 那么你的函数没有收敛。

最后，我建议不要使用 old.error 和 new.error 之类的东西，而是使用一个记录所有错误的向量。然后您可以绘制该向量以查看函数收敛的速度。

梯度下降计算错误

Error in Gradient Descent Calculation

r

gradient-descent