如何在 R 的这两种情况下为高斯分布生成数据?

How to generate data for gaussian distributions in these 2 scenarios in R?

在 Tibshirani 的 "Elements of Statistical Learning" 中,在比较最少 squares/linear 模型和 knn 时,说明了这两种情况:

Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.

Scenario 2: The training data in each class came from a mixture of 10 low- variance Gaussian distributions, with individual means themselves distributed as Gaussian.

想法是第一个更适合至少 squares/linear 模型,第二个更适合 knn 类模型(那些与我理解的方差更大的模型,因为 knn 考虑了最近的点而不是所有点).

在 R 中,我将如何为这两种情况模拟数据?

最终目标是能够重现这两种情况,以证明线性模型比第二种情况更能有效地解释第一种情况。

谢谢!

在下面的代码中,我首先创建了 类 的 10 种不同的均值,然后使用这些均值从这些均值中抽取随机值。这两种情况的代码是相同的,但是您必须调整 类 内部和之间的方差以获得您想要的结果。

场景 1:

此处您要生成 10 个 类 具有不同均值(我假设均值服从双变量高斯分布)。 类 之间的差异远小于 类.

内的差异
library(MASS)
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 25
# variation between classes
var.within <- 225
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
  temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
  class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)

或者,如果您不关心指定 类 之间的方差,并且您不希望 类 内有任何相关性,您可以这样做:

covmatrix <- matrix(c(225, 0, 0, 225), nrow=2)
# specifies that the variance in both groups is 225 and no covariance
values <- matrix(mvrnorm(200, c(100,100), Sigma=covmatrix), nrow=200)
# creates a matrix of 200 individuals with two values each.

场景 2:

这里唯一的区别是类之间的变化大于类内的变化。尝试将变量 var.between 的值交换为 500 左右,将变量 var.within 的值交换为 25,您将在散点图中看到清晰的聚类:

n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 500
# variation between classes
var.within <- 25
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
  temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
  class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)

该图应确认数据是聚类的。

希望对您有所帮助!

这可能是场景 1

library(mvtnorm)

N1 = 50
N2 = 50
K = 2

mu1 = c(-1,3)
mu2 = c(2,0)

cov1 = 0
v11 = 2
v12 = 2
Sigma1 = matrix(c(v11,cov1,cov1,v12),nrow=2)

cov2 = 0
v21 = 2
v22 = 2
Sigma2 = matrix(c(v21,cov2,cov2,v22),nrow=2)

x1 = rmvnorm(N1,mu1,Sigma1)
x2 = rmvnorm(N2,mu2,Sigma2)

这可能是从高斯混合进行模拟的候选对象:

BartSimpson <- function(x,n = 100){ 
   means <- as.matrix(sort(rnorm(10)))
   dens <- .1*rowSums(apply(means,1,dnorm,x=x,sd=.1)) 
   rBartSimpson <- c(apply(means,1,rnorm,n=n/10,sd=.1))
   return(list("thedensity" = dens,"draws" = rBartSimpson))
}

x <- seq(-5,5,by=.01)

plot(x,BartSimpson(x)$thedensity,type="l",lwd=4,col="yellow2",xlim=c(-4,4),ylim=c(0,0.6))

在这两个答案的帮助下,我最终使用了这个:

mixed_dists = function(n, n_means, var=0.2) {
    means = rnorm(n_means, mean=1, sd=2)
    values <- NULL
    class <- NULL
    for (i in 1:n_means) {
        temp <- rnorm(n/n_means, mean=means[i], sd=0.2)
        class <- c(class, rep(i, n/n_means))
        values <- c(values, temp)
    }
    return(list(values, class));
}

N = 100

#Scenario 1: The training data in each class were generated from bivariate Gaussian distributions 
#with uncorrelated components and different means.
scenario1 = function () { 
    var = 0.5
    n_groups = 2
    m = mixed_dists(N, n_groups, var=var)
    x = m[[1]]
    group = m[[2]]
    y = mixed_dists(N, n_groups, var=var)[[1]]
    data = matrix(c(x,y, group), nrow=N, ncol=3)
    colnames(data) = c("x", "y", "group")
    data = data.frame(data)
    plot(x=data$x,y=data$y, col=data$group)
    model = lm(y~x, data=data)
    summary(model) 
}



#Scenario 2: The training data in each class came from a mixture of 10 
#low-variance Gaussian distributions, with individual means themselves
#distributed as Gaussian.
scenario2 = function () {
    var = 0.2 # low variance
    n_groups = 10
    m = mixed_dists(N, n_groups, var=var)
    x = m[[1]]
    group = m[[2]]
    y = mixed_dists(N, n_groups, var=var)[[1]]
    data = matrix(c(x,y, group), nrow=N, ncol=3)
    colnames(data) = c("x", "y", "group")
    data = data.frame(data)
    plot(x=data$x,y=data$y, col=data$group)
    model = lm(y~x, data=data)
    summary(model)
}
# scenario1()
# scenario2()

所以基本上场景 1 中的数据在场景 2 中完全分离 类 并且场景 2 中的数据有大约 10 个簇,并且不能使用直线干净地分离。事实上,运行 两种场景的线性模型可以看出,平均而言,它更适用于场景 1 而不是场景 2。