如何从 R 中的自定义数据生成的分布中采样?

How to sample from a distribution generated by custom data in R?

我的问题:我有一个连续值的数据集,但需要生成更多“人工”数据点,以便我有足够的能力进行一些分析。

我提出的解决方案: 通过将数据集划分为等宽的 bin,然后根据其高度从该 bin 的范围内抽取一个随机数,从数据集的密度分布中抽样.像这样:

我的尝试:


data <- c(-0.33823898, 0.14126138, -0.38847235, -0.44757043,  0.26778500, 0.15952806, -0.13811138, -0.46637437, -0.87334526,  0.14636530, -1.38293191,  0.06604563, 1.05781892, -0.36053508, -0.47711948,  0.09640056, -0.07901330,  0.12299470,
0.12782999, 0.26214382,  0.27154579,  0.05879269,  0.54823227, -0.33394094,  0.30781052, 0.93317569, 1.60367031, 1.49084669, -0.95366493, 1.07823462, 0.56246953, 0.03972012, 0.45448122,  1.16204645, -0.60982154, 0.58342249,
-0.75434321,  1.18192489, -0.14993100,  0.39269686, -1.38293191, -0.41049982, -0.29606444, -0.34978997,  0.23044576, 0.21379084, -0.02313875,  0.51465381,  0.86655603,  0.45931651, -0.32077818, -0.39975471, -1.38293191, -0.48625282,
-0.54696267,  0.15630452,  0.55118717, -0.24395068,  0.12675548,  0.03972012,  0.21647712, -0.97596102, -1.38293191,  0.14448491, -0.45724103, -0.07364074,  0.03273580,  0.62210487,  0.09989272, -0.37504097,  0.55817149, -0.40888805,
-0.61089605,  0.56085777, 0.27073990,  0.65326568, -1.38293191,  0.71451278, -0.46368809,  0.58503425,  0.29491639, -1.38293191,  1.18568568,  2.71525152,  0.15254374,  1.50965063,  0.60061466,  0.35777526,  0.08216329,  0.57321464,
-0.46315084,  0.57751268,  2.68301621,  0.44857141,  1.10294837,  2.08934910,  0.56461855,  1.19857981,  1.44303097,  1.21201119,  1.54672124,  1.04707381,  1.14431702,  1.06050520,  0.10795155,  1.24639553,  2.52774942,  2.30640024,
0.91544626,  1.39682701, -0.63507253,  3.35136180,  0.71182651,  1.01913654,  0.76662655,  1.84812147,  0.69893238,  0.82196384,  0.63392449,  0.68227746,  0.46361456, -0.06504466,  0.37604194,  1.05029735,  1.93354506,  2.30371396,
2.45307094,  0.95090511,  1.46129765, -0.57060190,  0.77629714,  1.08360717,  1.68210958,  0.17242218,  3.41583244,  2.37087088,  2.54064355,  0.83109718,  1.31086617,  2.59222006,  2.36818460,  1.38500740,  1.90130974,  1.47419178,
-0.66462158,  0.86010897,  1.41025840,  0.72310887,  1.13894447,  1.08414443)

cdf <- ecdf(data)  #generate the ecdf of the data distribution
y <- runif(500)  #generate dummy vales
new_data <- cdf(y) 

如您所见,我的尝试效果不佳!

(黑色是原始数据,红色是new_data)

R 中是否有 function/package 接受我的数据向量并自动生成与向量分布匹配的新数据?

像这样可以吗? x 是小向量,y 是额外生成的值。最后 c(x,y) 将达到您的目的

set.seed(1)
x <- rnorm(1000, 10, 5)

mean(x)
#> [1] 9.941759
sd(x)
#> [1] 5.174579
set.seed(1)
y <- rnorm(10000, mean(x), sd(x))
mean(y)
#> [1] 9.907933
sd(y)
#> [1] 5.238519
library(tidyverse)
ggplot() + geom_histogram(aes(x = y))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot() + geom_histogram(aes(x = x))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

z <- c(x, y)
ggplot() + geom_histogram(aes(x = z))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

reprex package (v2.0.0)

于 2021-06-01 创建

这是一种方法。

data <- c(-0.33823898, 0.14126138, -0.38847235, -0.44757043,  0.26778500, 0.15952806, -0.13811138, -0.46637437, -0.87334526,  0.14636530, -1.38293191,  0.06604563, 1.05781892, -0.36053508, -0.47711948,  0.09640056, -0.07901330,  0.12299470,
          0.12782999, 0.26214382,  0.27154579,  0.05879269,  0.54823227, -0.33394094,  0.30781052, 0.93317569, 1.60367031, 1.49084669, -0.95366493, 1.07823462, 0.56246953, 0.03972012, 0.45448122,  1.16204645, -0.60982154, 0.58342249,
          -0.75434321,  1.18192489, -0.14993100,  0.39269686, -1.38293191, -0.41049982, -0.29606444, -0.34978997,  0.23044576, 0.21379084, -0.02313875,  0.51465381,  0.86655603,  0.45931651, -0.32077818, -0.39975471, -1.38293191, -0.48625282,
          -0.54696267,  0.15630452,  0.55118717, -0.24395068,  0.12675548,  0.03972012,  0.21647712, -0.97596102, -1.38293191,  0.14448491, -0.45724103, -0.07364074,  0.03273580,  0.62210487,  0.09989272, -0.37504097,  0.55817149, -0.40888805,
          -0.61089605,  0.56085777, 0.27073990,  0.65326568, -1.38293191,  0.71451278, -0.46368809,  0.58503425,  0.29491639, -1.38293191,  1.18568568,  2.71525152,  0.15254374,  1.50965063,  0.60061466,  0.35777526,  0.08216329,  0.57321464,
          -0.46315084,  0.57751268,  2.68301621,  0.44857141,  1.10294837,  2.08934910,  0.56461855,  1.19857981,  1.44303097,  1.21201119,  1.54672124,  1.04707381,  1.14431702,  1.06050520,  0.10795155,  1.24639553,  2.52774942,  2.30640024,
          0.91544626,  1.39682701, -0.63507253,  3.35136180,  0.71182651,  1.01913654,  0.76662655,  1.84812147,  0.69893238,  0.82196384,  0.63392449,  0.68227746,  0.46361456, -0.06504466,  0.37604194,  1.05029735,  1.93354506,  2.30371396,
          2.45307094,  0.95090511,  1.46129765, -0.57060190,  0.77629714,  1.08360717,  1.68210958,  0.17242218,  3.41583244,  2.37087088,  2.54064355,  0.83109718,  1.31086617,  2.59222006,  2.36818460,  1.38500740,  1.90130974,  1.47419178,
          -0.66462158,  0.86010897,  1.41025840,  0.72310887,  1.13894447,  1.08414443)

br <- seq(-2,4,0.1) # this is bins for your distribution. Choose less number of bins for smooth distribution
probDist <- hist(data, breaks = br)$counts/length(data)
totalPoint <- 1000 # number of points needed
pointsToKeep <- round(totalPoint*probDist)
newData <- as.list(rep(0,(length(br)-1)))
for(i in 1:(length(br)-1)){
newData[[i]] <- runif(pointsToKeep[i], min = br[i], max = br[i+1])
}
newData <- do.call(c,newData)
library(ggplot2)
ggplot() + geom_density(aes(x = data)) + geom_density(aes(x = newData), color='red')

根据您的要求更改这两行。

br <- seq(-2,4,0.1) 
totalPoint <- 1000