如何在 R 的这两种情况下为高斯分布生成数据?
How to generate data for gaussian distributions in these 2 scenarios in R?
在 Tibshirani 的 "Elements of Statistical Learning" 中,在比较最少 squares/linear 模型和 knn 时,说明了这两种情况:
Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
Scenario 2: The training data in each class came from a mixture of 10
low- variance Gaussian distributions, with individual means themselves
distributed as Gaussian.
想法是第一个更适合至少 squares/linear 模型,第二个更适合 knn 类模型(那些与我理解的方差更大的模型,因为 knn 考虑了最近的点而不是所有点).
在 R 中,我将如何为这两种情况模拟数据?
最终目标是能够重现这两种情况,以证明线性模型比第二种情况更能有效地解释第一种情况。
谢谢!
在下面的代码中,我首先创建了 类 的 10 种不同的均值,然后使用这些均值从这些均值中抽取随机值。这两种情况的代码是相同的,但是您必须调整 类 内部和之间的方差以获得您想要的结果。
场景 1:
此处您要生成 10 个 类 具有不同均值(我假设均值服从双变量高斯分布)。 类 之间的差异远小于 类.
内的差异
library(MASS)
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 25
# variation between classes
var.within <- 225
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
或者,如果您不关心指定 类 之间的方差,并且您不希望 类 内有任何相关性,您可以这样做:
covmatrix <- matrix(c(225, 0, 0, 225), nrow=2)
# specifies that the variance in both groups is 225 and no covariance
values <- matrix(mvrnorm(200, c(100,100), Sigma=covmatrix), nrow=200)
# creates a matrix of 200 individuals with two values each.
场景 2:
这里唯一的区别是类之间的变化大于类内的变化。尝试将变量 var.between 的值交换为 500 左右,将变量 var.within 的值交换为 25,您将在散点图中看到清晰的聚类:
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 500
# variation between classes
var.within <- 25
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
该图应确认数据是聚类的。
希望对您有所帮助!
这可能是场景 1
library(mvtnorm)
N1 = 50
N2 = 50
K = 2
mu1 = c(-1,3)
mu2 = c(2,0)
cov1 = 0
v11 = 2
v12 = 2
Sigma1 = matrix(c(v11,cov1,cov1,v12),nrow=2)
cov2 = 0
v21 = 2
v22 = 2
Sigma2 = matrix(c(v21,cov2,cov2,v22),nrow=2)
x1 = rmvnorm(N1,mu1,Sigma1)
x2 = rmvnorm(N2,mu2,Sigma2)
这可能是从高斯混合进行模拟的候选对象:
BartSimpson <- function(x,n = 100){
means <- as.matrix(sort(rnorm(10)))
dens <- .1*rowSums(apply(means,1,dnorm,x=x,sd=.1))
rBartSimpson <- c(apply(means,1,rnorm,n=n/10,sd=.1))
return(list("thedensity" = dens,"draws" = rBartSimpson))
}
x <- seq(-5,5,by=.01)
plot(x,BartSimpson(x)$thedensity,type="l",lwd=4,col="yellow2",xlim=c(-4,4),ylim=c(0,0.6))
在这两个答案的帮助下,我最终使用了这个:
mixed_dists = function(n, n_means, var=0.2) {
means = rnorm(n_means, mean=1, sd=2)
values <- NULL
class <- NULL
for (i in 1:n_means) {
temp <- rnorm(n/n_means, mean=means[i], sd=0.2)
class <- c(class, rep(i, n/n_means))
values <- c(values, temp)
}
return(list(values, class));
}
N = 100
#Scenario 1: The training data in each class were generated from bivariate Gaussian distributions
#with uncorrelated components and different means.
scenario1 = function () {
var = 0.5
n_groups = 2
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
#Scenario 2: The training data in each class came from a mixture of 10
#low-variance Gaussian distributions, with individual means themselves
#distributed as Gaussian.
scenario2 = function () {
var = 0.2 # low variance
n_groups = 10
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
# scenario1()
# scenario2()
所以基本上场景 1 中的数据在场景 2 中完全分离 类 并且场景 2 中的数据有大约 10 个簇,并且不能使用直线干净地分离。事实上,运行 两种场景的线性模型可以看出,平均而言,它更适用于场景 1 而不是场景 2。
在 Tibshirani 的 "Elements of Statistical Learning" 中,在比较最少 squares/linear 模型和 knn 时,说明了这两种情况:
Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
Scenario 2: The training data in each class came from a mixture of 10 low- variance Gaussian distributions, with individual means themselves distributed as Gaussian.
想法是第一个更适合至少 squares/linear 模型,第二个更适合 knn 类模型(那些与我理解的方差更大的模型,因为 knn 考虑了最近的点而不是所有点).
在 R 中,我将如何为这两种情况模拟数据?
最终目标是能够重现这两种情况,以证明线性模型比第二种情况更能有效地解释第一种情况。
谢谢!
在下面的代码中,我首先创建了 类 的 10 种不同的均值,然后使用这些均值从这些均值中抽取随机值。这两种情况的代码是相同的,但是您必须调整 类 内部和之间的方差以获得您想要的结果。
场景 1:
此处您要生成 10 个 类 具有不同均值(我假设均值服从双变量高斯分布)。 类 之间的差异远小于 类.
内的差异library(MASS)
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 25
# variation between classes
var.within <- 225
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
或者,如果您不关心指定 类 之间的方差,并且您不希望 类 内有任何相关性,您可以这样做:
covmatrix <- matrix(c(225, 0, 0, 225), nrow=2)
# specifies that the variance in both groups is 225 and no covariance
values <- matrix(mvrnorm(200, c(100,100), Sigma=covmatrix), nrow=200)
# creates a matrix of 200 individuals with two values each.
场景 2:
这里唯一的区别是类之间的变化大于类内的变化。尝试将变量 var.between 的值交换为 500 左右,将变量 var.within 的值交换为 25,您将在散点图中看到清晰的聚类:
n <- 20
# subjects per class
classes <- 10
# number of classes
mean <- 100
# mean value for all classes
var.between <- 500
# variation between classes
var.within <- 25
# variation within classes
covmatrix1 <- matrix(c(var.between,0,0,var.between), nrow=2)
# covariance matrix for the classes
means <- mvrnorm(classes, c(100,100), Sigma=covmatrix1)
# creates the means for the two variables for each class using variance between classes
covmatrix2 <- matrix(c(var.within,0,0,var.within), nrow=2)
# creates a covariance matrix for the subjects
class <- NULL
values <- NULL
for (i in 1:10) {
temp <- mvrnorm(n, c(means[i], means[i+classes]), Sigma=covmatrix2)
class <- c(class, rep(i, n))
values <- c(values, temp)
}
# this loop uses generates data for each class based on the class means and variance within classes
valuematrix <- matrix(values, nrow=(n*classes))
data <- data.frame (class, valuematrix)
plot(data$X1, data$X2)
该图应确认数据是聚类的。
希望对您有所帮助!
这可能是场景 1
library(mvtnorm)
N1 = 50
N2 = 50
K = 2
mu1 = c(-1,3)
mu2 = c(2,0)
cov1 = 0
v11 = 2
v12 = 2
Sigma1 = matrix(c(v11,cov1,cov1,v12),nrow=2)
cov2 = 0
v21 = 2
v22 = 2
Sigma2 = matrix(c(v21,cov2,cov2,v22),nrow=2)
x1 = rmvnorm(N1,mu1,Sigma1)
x2 = rmvnorm(N2,mu2,Sigma2)
这可能是从高斯混合进行模拟的候选对象:
BartSimpson <- function(x,n = 100){
means <- as.matrix(sort(rnorm(10)))
dens <- .1*rowSums(apply(means,1,dnorm,x=x,sd=.1))
rBartSimpson <- c(apply(means,1,rnorm,n=n/10,sd=.1))
return(list("thedensity" = dens,"draws" = rBartSimpson))
}
x <- seq(-5,5,by=.01)
plot(x,BartSimpson(x)$thedensity,type="l",lwd=4,col="yellow2",xlim=c(-4,4),ylim=c(0,0.6))
在这两个答案的帮助下,我最终使用了这个:
mixed_dists = function(n, n_means, var=0.2) {
means = rnorm(n_means, mean=1, sd=2)
values <- NULL
class <- NULL
for (i in 1:n_means) {
temp <- rnorm(n/n_means, mean=means[i], sd=0.2)
class <- c(class, rep(i, n/n_means))
values <- c(values, temp)
}
return(list(values, class));
}
N = 100
#Scenario 1: The training data in each class were generated from bivariate Gaussian distributions
#with uncorrelated components and different means.
scenario1 = function () {
var = 0.5
n_groups = 2
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
#Scenario 2: The training data in each class came from a mixture of 10
#low-variance Gaussian distributions, with individual means themselves
#distributed as Gaussian.
scenario2 = function () {
var = 0.2 # low variance
n_groups = 10
m = mixed_dists(N, n_groups, var=var)
x = m[[1]]
group = m[[2]]
y = mixed_dists(N, n_groups, var=var)[[1]]
data = matrix(c(x,y, group), nrow=N, ncol=3)
colnames(data) = c("x", "y", "group")
data = data.frame(data)
plot(x=data$x,y=data$y, col=data$group)
model = lm(y~x, data=data)
summary(model)
}
# scenario1()
# scenario2()
所以基本上场景 1 中的数据在场景 2 中完全分离 类 并且场景 2 中的数据有大约 10 个簇,并且不能使用直线干净地分离。事实上,运行 两种场景的线性模型可以看出,平均而言,它更适用于场景 1 而不是场景 2。