随机抽样给出一个精确的总和
Random sampling to give an exact sum
我想对 1000 到 100000 之间的 140 个数字进行采样,使得这 140 个数字的总和约为 200 万(2000000):
sample(1000:100000,140)
这样:
sum(sample(1000:100000,140)) = 2000000
有什么方法可以实现吗?
这里有一些接近 200 万的 hacky 方法。希望有人会 post 一个更聪明的解决方案。
在此选项中,我们使用 prob
参数来使更小的值更有可能,我们通过反复试验选择指数。此方法严重倾向于在 OP 中指定的范围内选择较低的值。
x1 = sample(1000:100000,140, prob=(1e5:1e3)^5.5)
mean(replicate(100, sum(sample(1000:100000,140, prob=(1e5:1e3)^5.5))))
[1] 2015620
在此选项中,我们从截断的法线(在您给定的边界处截断)中采样。我们最初将平均值设置为 2e6/140=14285.71。但是,如果标准偏差大到足以在下边界附近产生大量值,则截断会使平均值偏高,因此我们添加了通过反复试验选择的校正。
library(truncnorm)
x2 = rtruncnorm(140, 1e3, 1e5, mean=0.82*2e6/140, sd=1e4)
mean(replicate(1000, sum(rtruncnorm(140, 1e3, 1e5, mean=0.82*2e6/140, sd=1e4))))
[1] 2008050
如果您设置的标准偏差较小,则无需校正。但是,通过这种方式,您得到的值与平均值相去甚远。
mean(replicate(1000, sum(rtruncnorm(140, 1e3, 1e5, mean=2e6/140, sd=0.5e4))))
[1] 2008494
在任何一种情况下,sample
方法的指数,或对截断法线的校正都可以通过自动搜索来选择,其中包含平均总和与 200 万的差值。
以下是输出的一些典型分布:
这里试一试,尝试改变上键。这个想法是当总和越来越高时减少上限。
sup<- 100000
tir <- vector(length = 140)
for(i in 1:140){
print(i)
tir[i] <- sample(1000:sup,1)
sup <- max(1001,min(sup,abs(2000000 - sum(tir,na.rm = T))/(140-i)*2))
}
sum(tir)
[1] 2001751
这是一个碰碰运气的方法。基本思想是找到 140 个总和为 2000000 的数字等同于将 1:2000000 分成 140 个部分,这需要 139 个分割点。另外请注意,最小值 1000 有点烦人。只需从所有问题数据中减去它并在末尾添加回去:
rand.nums <- function(a,b,n,k){
#finds n random integers in range a:b which sum to k
while(TRUE){
x <- sample(1:(k - n*a),n-1, replace = TRUE) #cutpoints
x <- sort(x)
x <- c(x,k-n*a) - c(0,x)
if(max(x) <= b-a) return(a+x)
}
}
然后 rand.nums(1000,100000,140,2000000)
计算给定范围内的 140 个整数,总和为 2000000。对于这些参数选择,函数 returns 几乎立即生效。对于参数的其他选择,解决方案可能是不可能的,或者受到如此严格的约束,以至于实际上不可能偶然找到一个解决方案。因此,在使用该功能时需要谨慎。可以通过添加 maxtrials
参数进行修改,如果超过 maxtrials 而未找到解决方案,则返回 NA
。
存在生成此类随机数的算法。
最初是为 MATLAB 创建的,它有一个 R 实现:
引自 MATLAB 脚本评论:
% This generates an n by m array x, each of whose m columns
% contains n random values lying in the interval [a,b], but
% subject to the condition that their sum be equal to s. The
% scalar value s must accordingly satisfy n*a <= s <= n*b. The
% distribution of values is uniform in the sense that it has the
% conditional probability distribution of a uniform distribution
% over the whole n-cube, given that the sum of the x's is s.
%
% The scalar v, if requested, returns with the total
% n-1 dimensional volume (content) of the subset satisfying
% this condition. Consequently if v, considered as a function
% of s and divided by sqrt(n), is integrated with respect to s
% from s = a to s = b, the result would necessarily be the
% n-dimensional volume of the whole cube, namely (b-a)^n.
%
% This algorithm does no "rejecting" on the sets of x's it
% obtains. It is designed to generate only those that satisfy all
% the above conditions and to do so with a uniform distribution.
% It accomplishes this by decomposing the space of all possible x
% sets (columns) into n-1 dimensional simplexes. (Line segments,
% triangles, and tetrahedra, are one-, two-, and three-dimensional
% examples of simplexes, respectively.) It makes use of three
% different sets of 'rand' variables, one to locate values
% uniformly within each type of simplex, another to randomly
% select representatives of each different type of simplex in
% proportion to their volume, and a third to perform random
% permutations to provide an even distribution of simplex choices
% among like types. For example, with n equal to 3 and s set at,
% say, 40% of the way from a towards b, there will be 2 different
% types of simplex, in this case triangles, each with its own
% area, and 6 different versions of each from permutations, for
% a total of 12 triangles, and these all fit together to form a
% particular planar non-regular hexagon in 3 dimensions, with v
% returned set equal to the hexagon's area.
%
% Roger Stafford - Jan. 19, 2006
示例:
test <- Surrogate::RandVec(a=1000, b=100000, s=2000000, n=140, m=1, Seed=sample(1:1000, size = 1))
sum(test$RandVecOutput)
# 2000000
hist(test$RandVecOutput)
我想对 1000 到 100000 之间的 140 个数字进行采样,使得这 140 个数字的总和约为 200 万(2000000):
sample(1000:100000,140)
这样:
sum(sample(1000:100000,140)) = 2000000
有什么方法可以实现吗?
这里有一些接近 200 万的 hacky 方法。希望有人会 post 一个更聪明的解决方案。
在此选项中,我们使用 prob
参数来使更小的值更有可能,我们通过反复试验选择指数。此方法严重倾向于在 OP 中指定的范围内选择较低的值。
x1 = sample(1000:100000,140, prob=(1e5:1e3)^5.5)
mean(replicate(100, sum(sample(1000:100000,140, prob=(1e5:1e3)^5.5))))
[1] 2015620
在此选项中,我们从截断的法线(在您给定的边界处截断)中采样。我们最初将平均值设置为 2e6/140=14285.71。但是,如果标准偏差大到足以在下边界附近产生大量值,则截断会使平均值偏高,因此我们添加了通过反复试验选择的校正。
library(truncnorm)
x2 = rtruncnorm(140, 1e3, 1e5, mean=0.82*2e6/140, sd=1e4)
mean(replicate(1000, sum(rtruncnorm(140, 1e3, 1e5, mean=0.82*2e6/140, sd=1e4))))
[1] 2008050
如果您设置的标准偏差较小,则无需校正。但是,通过这种方式,您得到的值与平均值相去甚远。
mean(replicate(1000, sum(rtruncnorm(140, 1e3, 1e5, mean=2e6/140, sd=0.5e4))))
[1] 2008494
在任何一种情况下,sample
方法的指数,或对截断法线的校正都可以通过自动搜索来选择,其中包含平均总和与 200 万的差值。
以下是输出的一些典型分布:
这里试一试,尝试改变上键。这个想法是当总和越来越高时减少上限。
sup<- 100000
tir <- vector(length = 140)
for(i in 1:140){
print(i)
tir[i] <- sample(1000:sup,1)
sup <- max(1001,min(sup,abs(2000000 - sum(tir,na.rm = T))/(140-i)*2))
}
sum(tir)
[1] 2001751
这是一个碰碰运气的方法。基本思想是找到 140 个总和为 2000000 的数字等同于将 1:2000000 分成 140 个部分,这需要 139 个分割点。另外请注意,最小值 1000 有点烦人。只需从所有问题数据中减去它并在末尾添加回去:
rand.nums <- function(a,b,n,k){
#finds n random integers in range a:b which sum to k
while(TRUE){
x <- sample(1:(k - n*a),n-1, replace = TRUE) #cutpoints
x <- sort(x)
x <- c(x,k-n*a) - c(0,x)
if(max(x) <= b-a) return(a+x)
}
}
然后 rand.nums(1000,100000,140,2000000)
计算给定范围内的 140 个整数,总和为 2000000。对于这些参数选择,函数 returns 几乎立即生效。对于参数的其他选择,解决方案可能是不可能的,或者受到如此严格的约束,以至于实际上不可能偶然找到一个解决方案。因此,在使用该功能时需要谨慎。可以通过添加 maxtrials
参数进行修改,如果超过 maxtrials 而未找到解决方案,则返回 NA
。
存在生成此类随机数的算法。
最初是为 MATLAB 创建的,它有一个 R 实现:
引自 MATLAB 脚本评论:
% This generates an n by m array x, each of whose m columns
% contains n random values lying in the interval [a,b], but
% subject to the condition that their sum be equal to s. The
% scalar value s must accordingly satisfy n*a <= s <= n*b. The
% distribution of values is uniform in the sense that it has the
% conditional probability distribution of a uniform distribution
% over the whole n-cube, given that the sum of the x's is s.
%
% The scalar v, if requested, returns with the total
% n-1 dimensional volume (content) of the subset satisfying
% this condition. Consequently if v, considered as a function
% of s and divided by sqrt(n), is integrated with respect to s
% from s = a to s = b, the result would necessarily be the
% n-dimensional volume of the whole cube, namely (b-a)^n.
%
% This algorithm does no "rejecting" on the sets of x's it
% obtains. It is designed to generate only those that satisfy all
% the above conditions and to do so with a uniform distribution.
% It accomplishes this by decomposing the space of all possible x
% sets (columns) into n-1 dimensional simplexes. (Line segments,
% triangles, and tetrahedra, are one-, two-, and three-dimensional
% examples of simplexes, respectively.) It makes use of three
% different sets of 'rand' variables, one to locate values
% uniformly within each type of simplex, another to randomly
% select representatives of each different type of simplex in
% proportion to their volume, and a third to perform random
% permutations to provide an even distribution of simplex choices
% among like types. For example, with n equal to 3 and s set at,
% say, 40% of the way from a towards b, there will be 2 different
% types of simplex, in this case triangles, each with its own
% area, and 6 different versions of each from permutations, for
% a total of 12 triangles, and these all fit together to form a
% particular planar non-regular hexagon in 3 dimensions, with v
% returned set equal to the hexagon's area.
%
% Roger Stafford - Jan. 19, 2006
示例:
test <- Surrogate::RandVec(a=1000, b=100000, s=2000000, n=140, m=1, Seed=sample(1:1000, size = 1))
sum(test$RandVecOutput)
# 2000000
hist(test$RandVecOutput)