如何从 R 中的自定义数据生成的分布中采样?
How to sample from a distribution generated by custom data in R?
我的问题:我有一个连续值的数据集,但需要生成更多“人工”数据点,以便我有足够的能力进行一些分析。
我提出的解决方案: 通过将数据集划分为等宽的 bin,然后根据其高度从该 bin 的范围内抽取一个随机数,从数据集的密度分布中抽样.像这样:
我的尝试:
data <- c(-0.33823898, 0.14126138, -0.38847235, -0.44757043, 0.26778500, 0.15952806, -0.13811138, -0.46637437, -0.87334526, 0.14636530, -1.38293191, 0.06604563, 1.05781892, -0.36053508, -0.47711948, 0.09640056, -0.07901330, 0.12299470,
0.12782999, 0.26214382, 0.27154579, 0.05879269, 0.54823227, -0.33394094, 0.30781052, 0.93317569, 1.60367031, 1.49084669, -0.95366493, 1.07823462, 0.56246953, 0.03972012, 0.45448122, 1.16204645, -0.60982154, 0.58342249,
-0.75434321, 1.18192489, -0.14993100, 0.39269686, -1.38293191, -0.41049982, -0.29606444, -0.34978997, 0.23044576, 0.21379084, -0.02313875, 0.51465381, 0.86655603, 0.45931651, -0.32077818, -0.39975471, -1.38293191, -0.48625282,
-0.54696267, 0.15630452, 0.55118717, -0.24395068, 0.12675548, 0.03972012, 0.21647712, -0.97596102, -1.38293191, 0.14448491, -0.45724103, -0.07364074, 0.03273580, 0.62210487, 0.09989272, -0.37504097, 0.55817149, -0.40888805,
-0.61089605, 0.56085777, 0.27073990, 0.65326568, -1.38293191, 0.71451278, -0.46368809, 0.58503425, 0.29491639, -1.38293191, 1.18568568, 2.71525152, 0.15254374, 1.50965063, 0.60061466, 0.35777526, 0.08216329, 0.57321464,
-0.46315084, 0.57751268, 2.68301621, 0.44857141, 1.10294837, 2.08934910, 0.56461855, 1.19857981, 1.44303097, 1.21201119, 1.54672124, 1.04707381, 1.14431702, 1.06050520, 0.10795155, 1.24639553, 2.52774942, 2.30640024,
0.91544626, 1.39682701, -0.63507253, 3.35136180, 0.71182651, 1.01913654, 0.76662655, 1.84812147, 0.69893238, 0.82196384, 0.63392449, 0.68227746, 0.46361456, -0.06504466, 0.37604194, 1.05029735, 1.93354506, 2.30371396,
2.45307094, 0.95090511, 1.46129765, -0.57060190, 0.77629714, 1.08360717, 1.68210958, 0.17242218, 3.41583244, 2.37087088, 2.54064355, 0.83109718, 1.31086617, 2.59222006, 2.36818460, 1.38500740, 1.90130974, 1.47419178,
-0.66462158, 0.86010897, 1.41025840, 0.72310887, 1.13894447, 1.08414443)
cdf <- ecdf(data) #generate the ecdf of the data distribution
y <- runif(500) #generate dummy vales
new_data <- cdf(y)
如您所见,我的尝试效果不佳!
(黑色是原始数据,红色是new_data)
R 中是否有 function/package 接受我的数据向量并自动生成与向量分布匹配的新数据?
像这样可以吗? x
是小向量,y
是额外生成的值。最后 c(x,y)
将达到您的目的
set.seed(1)
x <- rnorm(1000, 10, 5)
mean(x)
#> [1] 9.941759
sd(x)
#> [1] 5.174579
set.seed(1)
y <- rnorm(10000, mean(x), sd(x))
mean(y)
#> [1] 9.907933
sd(y)
#> [1] 5.238519
library(tidyverse)
ggplot() + geom_histogram(aes(x = y))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot() + geom_histogram(aes(x = x))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
z <- c(x, y)
ggplot() + geom_histogram(aes(x = z))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
由 reprex package (v2.0.0)
于 2021-06-01 创建
这是一种方法。
data <- c(-0.33823898, 0.14126138, -0.38847235, -0.44757043, 0.26778500, 0.15952806, -0.13811138, -0.46637437, -0.87334526, 0.14636530, -1.38293191, 0.06604563, 1.05781892, -0.36053508, -0.47711948, 0.09640056, -0.07901330, 0.12299470,
0.12782999, 0.26214382, 0.27154579, 0.05879269, 0.54823227, -0.33394094, 0.30781052, 0.93317569, 1.60367031, 1.49084669, -0.95366493, 1.07823462, 0.56246953, 0.03972012, 0.45448122, 1.16204645, -0.60982154, 0.58342249,
-0.75434321, 1.18192489, -0.14993100, 0.39269686, -1.38293191, -0.41049982, -0.29606444, -0.34978997, 0.23044576, 0.21379084, -0.02313875, 0.51465381, 0.86655603, 0.45931651, -0.32077818, -0.39975471, -1.38293191, -0.48625282,
-0.54696267, 0.15630452, 0.55118717, -0.24395068, 0.12675548, 0.03972012, 0.21647712, -0.97596102, -1.38293191, 0.14448491, -0.45724103, -0.07364074, 0.03273580, 0.62210487, 0.09989272, -0.37504097, 0.55817149, -0.40888805,
-0.61089605, 0.56085777, 0.27073990, 0.65326568, -1.38293191, 0.71451278, -0.46368809, 0.58503425, 0.29491639, -1.38293191, 1.18568568, 2.71525152, 0.15254374, 1.50965063, 0.60061466, 0.35777526, 0.08216329, 0.57321464,
-0.46315084, 0.57751268, 2.68301621, 0.44857141, 1.10294837, 2.08934910, 0.56461855, 1.19857981, 1.44303097, 1.21201119, 1.54672124, 1.04707381, 1.14431702, 1.06050520, 0.10795155, 1.24639553, 2.52774942, 2.30640024,
0.91544626, 1.39682701, -0.63507253, 3.35136180, 0.71182651, 1.01913654, 0.76662655, 1.84812147, 0.69893238, 0.82196384, 0.63392449, 0.68227746, 0.46361456, -0.06504466, 0.37604194, 1.05029735, 1.93354506, 2.30371396,
2.45307094, 0.95090511, 1.46129765, -0.57060190, 0.77629714, 1.08360717, 1.68210958, 0.17242218, 3.41583244, 2.37087088, 2.54064355, 0.83109718, 1.31086617, 2.59222006, 2.36818460, 1.38500740, 1.90130974, 1.47419178,
-0.66462158, 0.86010897, 1.41025840, 0.72310887, 1.13894447, 1.08414443)
br <- seq(-2,4,0.1) # this is bins for your distribution. Choose less number of bins for smooth distribution
probDist <- hist(data, breaks = br)$counts/length(data)
totalPoint <- 1000 # number of points needed
pointsToKeep <- round(totalPoint*probDist)
newData <- as.list(rep(0,(length(br)-1)))
for(i in 1:(length(br)-1)){
newData[[i]] <- runif(pointsToKeep[i], min = br[i], max = br[i+1])
}
newData <- do.call(c,newData)
library(ggplot2)
ggplot() + geom_density(aes(x = data)) + geom_density(aes(x = newData), color='red')
根据您的要求更改这两行。
br <- seq(-2,4,0.1)
totalPoint <- 1000
我的问题:我有一个连续值的数据集,但需要生成更多“人工”数据点,以便我有足够的能力进行一些分析。
我提出的解决方案: 通过将数据集划分为等宽的 bin,然后根据其高度从该 bin 的范围内抽取一个随机数,从数据集的密度分布中抽样.像这样:
我的尝试:
data <- c(-0.33823898, 0.14126138, -0.38847235, -0.44757043, 0.26778500, 0.15952806, -0.13811138, -0.46637437, -0.87334526, 0.14636530, -1.38293191, 0.06604563, 1.05781892, -0.36053508, -0.47711948, 0.09640056, -0.07901330, 0.12299470,
0.12782999, 0.26214382, 0.27154579, 0.05879269, 0.54823227, -0.33394094, 0.30781052, 0.93317569, 1.60367031, 1.49084669, -0.95366493, 1.07823462, 0.56246953, 0.03972012, 0.45448122, 1.16204645, -0.60982154, 0.58342249,
-0.75434321, 1.18192489, -0.14993100, 0.39269686, -1.38293191, -0.41049982, -0.29606444, -0.34978997, 0.23044576, 0.21379084, -0.02313875, 0.51465381, 0.86655603, 0.45931651, -0.32077818, -0.39975471, -1.38293191, -0.48625282,
-0.54696267, 0.15630452, 0.55118717, -0.24395068, 0.12675548, 0.03972012, 0.21647712, -0.97596102, -1.38293191, 0.14448491, -0.45724103, -0.07364074, 0.03273580, 0.62210487, 0.09989272, -0.37504097, 0.55817149, -0.40888805,
-0.61089605, 0.56085777, 0.27073990, 0.65326568, -1.38293191, 0.71451278, -0.46368809, 0.58503425, 0.29491639, -1.38293191, 1.18568568, 2.71525152, 0.15254374, 1.50965063, 0.60061466, 0.35777526, 0.08216329, 0.57321464,
-0.46315084, 0.57751268, 2.68301621, 0.44857141, 1.10294837, 2.08934910, 0.56461855, 1.19857981, 1.44303097, 1.21201119, 1.54672124, 1.04707381, 1.14431702, 1.06050520, 0.10795155, 1.24639553, 2.52774942, 2.30640024,
0.91544626, 1.39682701, -0.63507253, 3.35136180, 0.71182651, 1.01913654, 0.76662655, 1.84812147, 0.69893238, 0.82196384, 0.63392449, 0.68227746, 0.46361456, -0.06504466, 0.37604194, 1.05029735, 1.93354506, 2.30371396,
2.45307094, 0.95090511, 1.46129765, -0.57060190, 0.77629714, 1.08360717, 1.68210958, 0.17242218, 3.41583244, 2.37087088, 2.54064355, 0.83109718, 1.31086617, 2.59222006, 2.36818460, 1.38500740, 1.90130974, 1.47419178,
-0.66462158, 0.86010897, 1.41025840, 0.72310887, 1.13894447, 1.08414443)
cdf <- ecdf(data) #generate the ecdf of the data distribution
y <- runif(500) #generate dummy vales
new_data <- cdf(y)
如您所见,我的尝试效果不佳!
R 中是否有 function/package 接受我的数据向量并自动生成与向量分布匹配的新数据?
像这样可以吗? x
是小向量,y
是额外生成的值。最后 c(x,y)
将达到您的目的
set.seed(1)
x <- rnorm(1000, 10, 5)
mean(x)
#> [1] 9.941759
sd(x)
#> [1] 5.174579
set.seed(1)
y <- rnorm(10000, mean(x), sd(x))
mean(y)
#> [1] 9.907933
sd(y)
#> [1] 5.238519
library(tidyverse)
ggplot() + geom_histogram(aes(x = y))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot() + geom_histogram(aes(x = x))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
z <- c(x, y)
ggplot() + geom_histogram(aes(x = z))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
由 reprex package (v2.0.0)
于 2021-06-01 创建这是一种方法。
data <- c(-0.33823898, 0.14126138, -0.38847235, -0.44757043, 0.26778500, 0.15952806, -0.13811138, -0.46637437, -0.87334526, 0.14636530, -1.38293191, 0.06604563, 1.05781892, -0.36053508, -0.47711948, 0.09640056, -0.07901330, 0.12299470,
0.12782999, 0.26214382, 0.27154579, 0.05879269, 0.54823227, -0.33394094, 0.30781052, 0.93317569, 1.60367031, 1.49084669, -0.95366493, 1.07823462, 0.56246953, 0.03972012, 0.45448122, 1.16204645, -0.60982154, 0.58342249,
-0.75434321, 1.18192489, -0.14993100, 0.39269686, -1.38293191, -0.41049982, -0.29606444, -0.34978997, 0.23044576, 0.21379084, -0.02313875, 0.51465381, 0.86655603, 0.45931651, -0.32077818, -0.39975471, -1.38293191, -0.48625282,
-0.54696267, 0.15630452, 0.55118717, -0.24395068, 0.12675548, 0.03972012, 0.21647712, -0.97596102, -1.38293191, 0.14448491, -0.45724103, -0.07364074, 0.03273580, 0.62210487, 0.09989272, -0.37504097, 0.55817149, -0.40888805,
-0.61089605, 0.56085777, 0.27073990, 0.65326568, -1.38293191, 0.71451278, -0.46368809, 0.58503425, 0.29491639, -1.38293191, 1.18568568, 2.71525152, 0.15254374, 1.50965063, 0.60061466, 0.35777526, 0.08216329, 0.57321464,
-0.46315084, 0.57751268, 2.68301621, 0.44857141, 1.10294837, 2.08934910, 0.56461855, 1.19857981, 1.44303097, 1.21201119, 1.54672124, 1.04707381, 1.14431702, 1.06050520, 0.10795155, 1.24639553, 2.52774942, 2.30640024,
0.91544626, 1.39682701, -0.63507253, 3.35136180, 0.71182651, 1.01913654, 0.76662655, 1.84812147, 0.69893238, 0.82196384, 0.63392449, 0.68227746, 0.46361456, -0.06504466, 0.37604194, 1.05029735, 1.93354506, 2.30371396,
2.45307094, 0.95090511, 1.46129765, -0.57060190, 0.77629714, 1.08360717, 1.68210958, 0.17242218, 3.41583244, 2.37087088, 2.54064355, 0.83109718, 1.31086617, 2.59222006, 2.36818460, 1.38500740, 1.90130974, 1.47419178,
-0.66462158, 0.86010897, 1.41025840, 0.72310887, 1.13894447, 1.08414443)
br <- seq(-2,4,0.1) # this is bins for your distribution. Choose less number of bins for smooth distribution
probDist <- hist(data, breaks = br)$counts/length(data)
totalPoint <- 1000 # number of points needed
pointsToKeep <- round(totalPoint*probDist)
newData <- as.list(rep(0,(length(br)-1)))
for(i in 1:(length(br)-1)){
newData[[i]] <- runif(pointsToKeep[i], min = br[i], max = br[i+1])
}
newData <- do.call(c,newData)
library(ggplot2)
ggplot() + geom_density(aes(x = data)) + geom_density(aes(x = newData), color='red')
根据您的要求更改这两行。
br <- seq(-2,4,0.1)
totalPoint <- 1000