为什么 nls 和 nlsLM 可以正确地拟合泊松分布但不能拟合负二项分布?
Why do nls and nlsLM work correctly for fitting a Poisson distribution but fail for negative binomial?
我有两个人工生成的概率质量分布,除了一个是泊松分布而另一个是负二项式,方差比泊松分布稍大之外,它们彼此非常相似。我使用下面的 R 代码示例生成它们,然后尝试使用 nls 或 nlsLM 函数重新估计初始输入参数值:
library(minpack.lm)
library(ggplot2)
# Number of samples (identical for both distributions)
n <- 10000
# Distribution mean (identical for both distributions)
mn <- 10
# Size variable: relevant to negative binomial only; sets the level of
# over-dispersion relative to the Poisson distribution. Reproduces
# Poisson in the limit that size --> Inf
sz <- 5
# Generate n random samples
psx <- rpois(n, lambda=mn) # Poisson
nbx <- rnbinom(n, size=sz, mu=mn) # negative binomial
# Sort into sample quantiles
psqnt <- unique(sort(psx)) # Poisson
nbqnt <- unique(sort(nbx)) # negative binomial
# Generate empirical cdf functions
pscdf <- ecdf(psx) # Poisson
pscumdist <- pscdf(psqnt)
nbcdf <- ecdf(nbx) # negative binomial
nbcumdist <- nbcdf(nbqnt)
# Place quantiles and cdf into data frame
psdata <- data.frame(q=psqnt, cdf=pscumdist) # Poisson
nbdata <- data.frame(q=nbqnt, cdf=nbcumdist) # negative binomial
# Generate estimated starting values that are modified from true values by
# modest amounts
psstart <- list(lambda=0.8*mn) # Poisson
nbstart <- list(size=0.8*sz, mu=0.8*mn) # negative binomial
# Plot the sample density functions
pldata <- rbind(data.frame(x=psx, type="Poisson"),
data.frame(x=nbx, type="Negative Binomial"))
pldata$type <- factor(pldata$type, levels=c("Poisson", "Negative Binomial"))
hst <- ggplot(pldata, aes(x)) +
geom_histogram(binwidth=1) +
facet_grid(type ~ .) +
theme_gray(base_size=18)
print(hst)
# Re-estimate the Poisson distribution parameter, lambda, using either
# nls or nlsLM
print("Starting Poisson fit now...")
#psfit <- nls(cdf ~ ppois(q, lambda), data=psdata, start=psstart, trace=TRUE)
psfit <- nlsLM(cdf ~ ppois(q, lambda), data=psdata, start=psstart, trace=TRUE)
print(coef(psfit))
# Re-estimate the two negative binomial distribution parameters, size and mu,
# using the same technique
print("Starting negative binomial fit now...")
#nbfit <- nls(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
nbfit <- nlsLM(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
print(coef(nbfit))
对 ggplot 的调用生成了一对直方图,显示了两个明显非常相似的离散概率质量分布:
这里是 运行 nlsLM 的结果(nls 也给出了非常相似的结果,只是跟踪提供的信息略少):
> source('~/Desktop/nls_error.R')
[1] "Starting Poisson fit now..."
It. 0, RSS = 0.369437, Par. = 8
It. 1, RSS = 0.00130718, Par. = 9.8698
It. 2, RSS = 9.26239e-05, Par. = 9.98602
It. 3, RSS = 9.26083e-05, Par. = 9.9856
It. 4, RSS = 9.26083e-05, Par. = 9.9856
lambda
9.985601
[1] "Starting negative binomial fit now..."
It. 0, RSS = nan, Par. = 4 8
It. 1, RSS = 2.122e-314, Par. = 4 8
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning messages:
1: In pnbinom(q, size, mu) : NaNs produced
2: In pnbinom(q, size, mu) : NaNs produced
3: In pnbinom(q, size, mu) : NaNs produced
4: In pnbinom(q, size, mu) : NaNs produced
5: In pnbinom(q, size, mu) : NaNs produced
6: In pnbinom(q, size, mu) : NaNs produced
我的问题:我特意将两个例子构造得尽可能相似,那么为什么一个成功而另一个失败呢?
那是因为你被R中的默认参数顺序欺骗了。从pnbinom()
的帮助页面我们看到pnbinom()
有如下语法:
pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
您在调用中向 pnbinom()
提供了三个参数
nls(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
但是即使你的参数被调用 mu 它是第三个参数因此对应于pnbinom()
中的prob
。由于您使用的是替代公式,因此您需要 命名 参数以确保它被解释为 mu
。以下行按您的预期工作
> nbfit <- nls(cdf ~ pnbinom(q, size, mu=mu), data=nbdata, start=nbstart, trace=TRUE)
0.2185854 : 4 8
0.004568844 : 4.069641 9.972202
0.0001207377 : 4.921435 9.961606
3.952388e-05 : 5.068563 9.966108
3.948957e-05 : 5.071698 9.966222
3.948957e-05 : 5.071696 9.966224
如果 nls()
试图强制大小为负数,您可能 运行 会遇到问题。通过对输入参数取幂可以使它更稳定
> nbfit <- nls(cdf ~ pnbinom(q, exp(size),
mu=exp(mu)), data=nbdata, start=nbstart2, trace=TRUE)
0.2971457 : 3.688879 2.079442
0.2622977 : 0.4969337 2.1664490
0.00517649 : 1.408688 2.316948
6.196776e-05 : 1.610170 2.298254
3.948972e-05 : 1.623637 2.299200
3.948957e-05 : 1.623675 2.299202
其中 nbstart2
与 nbstart
相同,只是记录了起始参数。
我有两个人工生成的概率质量分布,除了一个是泊松分布而另一个是负二项式,方差比泊松分布稍大之外,它们彼此非常相似。我使用下面的 R 代码示例生成它们,然后尝试使用 nls 或 nlsLM 函数重新估计初始输入参数值:
library(minpack.lm)
library(ggplot2)
# Number of samples (identical for both distributions)
n <- 10000
# Distribution mean (identical for both distributions)
mn <- 10
# Size variable: relevant to negative binomial only; sets the level of
# over-dispersion relative to the Poisson distribution. Reproduces
# Poisson in the limit that size --> Inf
sz <- 5
# Generate n random samples
psx <- rpois(n, lambda=mn) # Poisson
nbx <- rnbinom(n, size=sz, mu=mn) # negative binomial
# Sort into sample quantiles
psqnt <- unique(sort(psx)) # Poisson
nbqnt <- unique(sort(nbx)) # negative binomial
# Generate empirical cdf functions
pscdf <- ecdf(psx) # Poisson
pscumdist <- pscdf(psqnt)
nbcdf <- ecdf(nbx) # negative binomial
nbcumdist <- nbcdf(nbqnt)
# Place quantiles and cdf into data frame
psdata <- data.frame(q=psqnt, cdf=pscumdist) # Poisson
nbdata <- data.frame(q=nbqnt, cdf=nbcumdist) # negative binomial
# Generate estimated starting values that are modified from true values by
# modest amounts
psstart <- list(lambda=0.8*mn) # Poisson
nbstart <- list(size=0.8*sz, mu=0.8*mn) # negative binomial
# Plot the sample density functions
pldata <- rbind(data.frame(x=psx, type="Poisson"),
data.frame(x=nbx, type="Negative Binomial"))
pldata$type <- factor(pldata$type, levels=c("Poisson", "Negative Binomial"))
hst <- ggplot(pldata, aes(x)) +
geom_histogram(binwidth=1) +
facet_grid(type ~ .) +
theme_gray(base_size=18)
print(hst)
# Re-estimate the Poisson distribution parameter, lambda, using either
# nls or nlsLM
print("Starting Poisson fit now...")
#psfit <- nls(cdf ~ ppois(q, lambda), data=psdata, start=psstart, trace=TRUE)
psfit <- nlsLM(cdf ~ ppois(q, lambda), data=psdata, start=psstart, trace=TRUE)
print(coef(psfit))
# Re-estimate the two negative binomial distribution parameters, size and mu,
# using the same technique
print("Starting negative binomial fit now...")
#nbfit <- nls(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
nbfit <- nlsLM(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
print(coef(nbfit))
对 ggplot 的调用生成了一对直方图,显示了两个明显非常相似的离散概率质量分布:
这里是 运行 nlsLM 的结果(nls 也给出了非常相似的结果,只是跟踪提供的信息略少):
> source('~/Desktop/nls_error.R')
[1] "Starting Poisson fit now..."
It. 0, RSS = 0.369437, Par. = 8
It. 1, RSS = 0.00130718, Par. = 9.8698
It. 2, RSS = 9.26239e-05, Par. = 9.98602
It. 3, RSS = 9.26083e-05, Par. = 9.9856
It. 4, RSS = 9.26083e-05, Par. = 9.9856
lambda
9.985601
[1] "Starting negative binomial fit now..."
It. 0, RSS = nan, Par. = 4 8
It. 1, RSS = 2.122e-314, Par. = 4 8
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning messages:
1: In pnbinom(q, size, mu) : NaNs produced
2: In pnbinom(q, size, mu) : NaNs produced
3: In pnbinom(q, size, mu) : NaNs produced
4: In pnbinom(q, size, mu) : NaNs produced
5: In pnbinom(q, size, mu) : NaNs produced
6: In pnbinom(q, size, mu) : NaNs produced
我的问题:我特意将两个例子构造得尽可能相似,那么为什么一个成功而另一个失败呢?
那是因为你被R中的默认参数顺序欺骗了。从pnbinom()
的帮助页面我们看到pnbinom()
有如下语法:
pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)
您在调用中向 pnbinom()
提供了三个参数
nls(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
但是即使你的参数被调用 mu 它是第三个参数因此对应于pnbinom()
中的prob
。由于您使用的是替代公式,因此您需要 命名 参数以确保它被解释为 mu
。以下行按您的预期工作
> nbfit <- nls(cdf ~ pnbinom(q, size, mu=mu), data=nbdata, start=nbstart, trace=TRUE)
0.2185854 : 4 8
0.004568844 : 4.069641 9.972202
0.0001207377 : 4.921435 9.961606
3.952388e-05 : 5.068563 9.966108
3.948957e-05 : 5.071698 9.966222
3.948957e-05 : 5.071696 9.966224
如果 nls()
试图强制大小为负数,您可能 运行 会遇到问题。通过对输入参数取幂可以使它更稳定
> nbfit <- nls(cdf ~ pnbinom(q, exp(size),
mu=exp(mu)), data=nbdata, start=nbstart2, trace=TRUE)
0.2971457 : 3.688879 2.079442
0.2622977 : 0.4969337 2.1664490
0.00517649 : 1.408688 2.316948
6.196776e-05 : 1.610170 2.298254
3.948972e-05 : 1.623637 2.299200
3.948957e-05 : 1.623675 2.299202
其中 nbstart2
与 nbstart
相同,只是记录了起始参数。