使用梯度下降（最速下降）估计线性回归

Question

示例数据

X<-matrix(c(rep(1,97),runif(97)) , nrow=97, ncol=2)
y<-matrix(runif(97), nrow= 97 , ncol =1)

我已成功创建成本函数

COST<-function(theta,X,y){
### Calculate half MSE 
    sum((X %*% theta - y)^2)/(2*length(y))
}

但是当我运行这个函数时，它似乎无法收敛超过 100 次迭代。

theta <- matrix (0, nrow=2,ncol=1)
num.iters <- 1500
delta = 0 

GD<-function(X,y,theta,alpha,num.iters){
    for (i in num.iters){

        while (max(abs(delta)) < tolerance){

            error <- X %*% theta - y
            delta <- (t(X) %*% error) / length(y)
            theta <- theta - alpha * delta
            cost_histo[i] <- COST(theta,X,y)
            theta_histo[[i]] <- theta

  }
  }
        return (list(cost_histo, theta_histo))
  }

有人可以帮助我吗？

干杯

Answer 1

您实现的算法部分是正确的。问题出在

GD中的循环结构不对； for 循环是多余的，变量缺乏正确的初始化。
使用固定的 alpha 简单地实现梯度下降是危险的。通常建议把这个alpha选的足够小，希望我们一直往下搜索objective函数。然而，这在实践中很少见。例如，多小才足够？如果它很小，那么收敛速度就是个问题；但是如果它很大，我们可能会陷入'zig-zag'搜索路径甚至发散！

这里是梯度下降的稳健版本，用于估计线性回归。改进来自 步减半 策略，以避免 "zig-zag" 或分歧。请参阅代码中的注释。在这种策略下，使用largealpha是安全的。 保证收敛

# theta: initial guess on regression coef
# alpha: initial step scaling factor
GD <- function(X, y, theta, alpha) {
  cost_histo <- numeric(0)
  theta_histo <- numeric(0)
  # an arbitrary initial gradient, to pass the initial while() check
  delta <- rep(1, ncol(X))
  # MSE at initial theta
  old.cost <- COST(theta, X, y)
  # main iteration loop
  while (max(abs(delta)) > 1e-7) {
    # gradient 
    error <- X %*% theta - y
    delta <- crossprod(X, error) / length(y)
    # proposal step
    trial.theta <- theta - alpha * c(delta)
    trial.cost <- COST(trial.theta, X, y)
    # step halving to avoid divergence
    while (trial.cost >= old.cost) {
      trial.theta <- (theta + trial.theta) / 2
      trial.cost <- COST(trial.theta, X, y)
      }
    # accept proposal
    cost_histo <- c(cost_histo, trial.cost)
    theta_histo <- c(theta_histo, trial.theta)
    # update old.cost and theta
    old.cost <- trial.cost
    theta <- trial.theta
    }
  list(cost_histo, theta_histo = matrix(theta_histo, nrow = ncol(X)))
  }

在 return,

cost_histo的长度告诉你已经进行了多少次迭代（不包括步减半）；
theta_histo 的每一列每次迭代给出 theta。

步减半实际上大大加快了收敛速度。如果对 COST 使用更快的计算方法，您可以获得更高的效率。（对大型数据集最有用。参见）

COST<-function(theta,X, y) {
  c(crossprod(X %*% theta - y)) /(2*length(y))
  }

现在，让我们考虑一下它在您的示例中的实现 X、y。

oo <- GD(X, y, c(0,0), 5)

经过 107 次迭代后收敛。我们可以查看MSE的踪迹：

plot(oo[[1]])

请注意，在前几步，MSE 下降得非常快，但随后几乎持平。这揭示了梯度下降算法的根本缺点：随着我们越来越接近最小值，收敛速度越来越慢。

现在，我们提取最终的估计系数：

oo[[2]][, 107]

我们还可以将其与 QR 因式分解的直接估计进行比较：

.lm.fit(X, y)$coef

他们很接近。

Answer 2

crossprod 使它比以前的方法慢得惊人：

以前的方法（50 次迭代平均 14 秒）：

Crossprod 方法（50 次迭代平均 16 秒）：

使用梯度下降（最速下降）估计线性回归

Estimating linear regression with Gradient Descent (Steepest Descent)

regression

r

linear-regression

gradient-descent