在 R 中拟合正态分布
Fitting a normal distribution in R
我正在使用以下代码来适应正态分布。 “b”数据集的 link(太大而不能直接 post)是:
link for b
setwd("xxxxxx")
library(fitdistrplus)
require(MASS)
tazur <-read.csv("b", header= TRUE, sep=",")
claims<-tazur$b
a<-log(claims)
plot(hist(a))
绘制直方图后,似乎正态分布应该很合适。
f1n <- fitdistr(claims,"normal")
summary(f1n)
#Length Class Mode
#estimate 2 -none- numeric
#sd 2 -none- numeric
#vcov 4 -none- numeric
#n 1 -none- numeric
#loglik 1 -none- numeric
plot(f1n)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' is a list, but does not have components 'x' and 'y'
当我尝试绘制拟合分布时出现上述错误,甚至 f1n 的汇总统计数据也关闭。
非常感谢任何帮助。
您似乎混淆了 MASS::fitdistr
和 fitdistrplus::fitdist
。
MASS::fitdistr
returns class "fitdistr" 的对象,并且没有 plot 方法。所以需要自己提取估计参数,绘制估计密度曲线。
- 我不知道你为什么加载包
fitdistrplus
,因为你的函数调用清楚地表明你正在使用 MASS
。无论如何,fitdistrplus
具有 fitdist
函数 returns class "fitdist" 的对象。这个 class 有绘图方法,但它不适用于 MASS
. 返回的 "fitdistr"
我将向您展示如何使用这两个包。
## reproducible example
set.seed(0); x <- rnorm(500)
使用MASS::fitdistr
没有plot方法,自己画
library(MASS)
fit <- fitdistr(x, "normal")
class(fit)
# [1] "fitdistr"
para <- fit$estimate
# mean sd
#-0.0002000485 0.9886248515
hist(x, prob = TRUE)
curve(dnorm(x, para[1], para[2]), col = 2, add = TRUE)
使用fitdistrplus::fitdist
library(fitdistrplus)
FIT <- fitdist(x, "norm") ## note: it is "norm" not "normal"
class(FIT)
# [1] "fitdist"
plot(FIT) ## use method `plot.fitdist`
回顾之前的回答
在之前的回答中我没有提到两种方法的区别。一般来说,如果我们选择最大似然推理,我会推荐使用 MASS::fitdistr
,因为对于许多基本分布,它执行精确推理而不是数值优化。 ?fitdistr
的文档说得很清楚:
For the Normal, log-Normal, geometric, exponential and Poisson
distributions the closed-form MLEs (and exact standard errors) are
used, and ‘start’ should not be supplied.
For all other distributions, direct optimization of the
log-likelihood is performed using ‘optim’. The estimated standard
errors are taken from the observed information matrix, calculated
by a numerical approximation. For one-dimensional problems the
Nelder-Mead method is used and for multi-dimensional problems the
BFGS method, unless arguments named ‘lower’ or ‘upper’ are
supplied (when ‘L-BFGS-B’ is used) or ‘method’ is supplied
explicitly.
另一方面,fitdistrplus::fitdist
总是以数字方式进行推理,即使存在精确推理也是如此。当然,fitdist
的优点是可以使用更多的推理原理:
Fit of univariate distributions to non-censored data by maximum
likelihood (mle), moment matching (mme), quantile matching (qme)
or maximizing goodness-of-fit estimation (mge).
此回答的目的
这个答案将探讨正态分布的精确推理。它会有理论的味道,但没有似然原理的证明;只给出结果。基于这些结果,我们编写了自己的 R 函数进行精确推理,可以与 MASS::fitdistr
进行比较。另一方面,为了与 fitdistrplus::fitdist
进行比较,我们使用 optim
在数值上最小化负对数似然函数。
这是学习统计学和相对高级使用 optim
的绝佳机会。为了方便起见,我将估计尺度参数:方差,而不是标准误差。
正态分布的精确推断
自己写推理函数
下面的代码注释得很好。有一个开关exact
。如果设置FALSE
,则选择数值解。
## fitting a normal distribution
fitnormal <- function (x, exact = TRUE) {
if (exact) {
################################################
## Exact inference based on likelihood theory ##
################################################
## minimum negative log-likelihood (maximum log-likelihood) estimator of `mu` and `phi = sigma ^ 2`
n <- length(x)
mu <- sum(x) / n
phi <- crossprod(x - mu)[1L] / n # (a bised estimator, though)
## inverse of Fisher information matrix evaluated at MLE
invI <- matrix(c(phi, 0, 0, phi * phi), 2L,
dimnames = list(c("mu", "sigma2"), c("mu", "sigma2")))
## log-likelihood at MLE
loglik <- -(n / 2) * (log(2 * pi * phi) + 1)
## return
return(list(theta = c(mu = mu, sigma2 = phi), vcov = invI, loglik = loglik, n = n))
}
else {
##################################################################
## Numerical optimization by minimizing negative log-likelihood ##
##################################################################
## negative log-likelihood function
## define `theta = c(mu, phi)` in order to use `optim`
nllik <- function (theta, x) {
(length(x) / 2) * log(2 * pi * theta[2]) + crossprod(x - theta[1])[1] / (2 * theta[2])
}
## gradient function (remember to flip the sign when using partial derivative result of log-likelihood)
## define `theta = c(mu, phi)` in order to use `optim`
gradient <- function (theta, x) {
pl2pmu <- -sum(x - theta[1]) / theta[2]
pl2pphi <- -crossprod(x - theta[1])[1] / (2 * theta[2] ^ 2) + length(x) / (2 * theta[2])
c(pl2pmu, pl2pphi)
}
## ask `optim` to return Hessian matrix by `hessian = TRUE`
## use "..." part to pass `x` as additional / further argument to "fn" and "gn"
## note, we want `phi` as positive so box constraint is used, with "L-BFGS-B" method chosen
init <- c(sample(x, 1), sample(abs(x) + 0.1, 1)) ## arbitrary valid starting values
z <- optim(par = init, fn = nllik, gr = gradient, x = x, lower = c(-Inf, 0), method = "L-BFGS-B", hessian = TRUE)
## post processing ##
theta <- z$par
loglik <- -z$value ## flip the sign to get log-likelihood
n <- length(x)
## Fisher information matrix (don't flip the sign as this is the Hessian for negative log-likelihood)
I <- z$hessian / n ## remember to take average to get mean
invI <- solve(I, diag(2L)) ## numerical inverse
dimnames(invI) <- list(c("mu", "sigma2"), c("mu", "sigma2"))
## return
return(list(theta = theta, vcov = invI, loglik = loglik, n = n))
}
}
我们仍然使用之前的数据进行测试:
set.seed(0); x <- rnorm(500)
## exact inference
fit <- fitnormal(x)
#$theta
# mu sigma2
#-0.0002000485 0.9773790969
#
#$vcov
# mu sigma2
#mu 0.9773791 0.0000000
#sigma2 0.0000000 0.9552699
#
#$loglik
#[1] -703.7491
#
#$n
#[1] 500
hist(x, prob = TRUE)
curve(dnorm(x, fit$theta[1], sqrt(fit$theta[2])), add = TRUE, col = 2)
数值方法也相当准确,除了方差协方差在对角线上没有精确的0:
fitnormal(x, FALSE)
#$theta
#[1] -0.0002235315 0.9773732277
#
#$vcov
# mu sigma2
#mu 9.773826e-01 5.359978e-06
#sigma2 5.359978e-06 1.910561e+00
#
#$loglik
#[1] -703.7491
#
#$n
#[1] 500
我正在使用以下代码来适应正态分布。 “b”数据集的 link(太大而不能直接 post)是:
link for b
setwd("xxxxxx")
library(fitdistrplus)
require(MASS)
tazur <-read.csv("b", header= TRUE, sep=",")
claims<-tazur$b
a<-log(claims)
plot(hist(a))
绘制直方图后,似乎正态分布应该很合适。
f1n <- fitdistr(claims,"normal")
summary(f1n)
#Length Class Mode
#estimate 2 -none- numeric
#sd 2 -none- numeric
#vcov 4 -none- numeric
#n 1 -none- numeric
#loglik 1 -none- numeric
plot(f1n)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' is a list, but does not have components 'x' and 'y'
当我尝试绘制拟合分布时出现上述错误,甚至 f1n 的汇总统计数据也关闭。
非常感谢任何帮助。
您似乎混淆了 MASS::fitdistr
和 fitdistrplus::fitdist
。
MASS::fitdistr
returns class "fitdistr" 的对象,并且没有 plot 方法。所以需要自己提取估计参数,绘制估计密度曲线。- 我不知道你为什么加载包
fitdistrplus
,因为你的函数调用清楚地表明你正在使用MASS
。无论如何,fitdistrplus
具有fitdist
函数 returns class "fitdist" 的对象。这个 class 有绘图方法,但它不适用于MASS
. 返回的 "fitdistr"
我将向您展示如何使用这两个包。
## reproducible example
set.seed(0); x <- rnorm(500)
使用MASS::fitdistr
没有plot方法,自己画
library(MASS)
fit <- fitdistr(x, "normal")
class(fit)
# [1] "fitdistr"
para <- fit$estimate
# mean sd
#-0.0002000485 0.9886248515
hist(x, prob = TRUE)
curve(dnorm(x, para[1], para[2]), col = 2, add = TRUE)
使用fitdistrplus::fitdist
library(fitdistrplus)
FIT <- fitdist(x, "norm") ## note: it is "norm" not "normal"
class(FIT)
# [1] "fitdist"
plot(FIT) ## use method `plot.fitdist`
回顾之前的回答
在之前的回答中我没有提到两种方法的区别。一般来说,如果我们选择最大似然推理,我会推荐使用 MASS::fitdistr
,因为对于许多基本分布,它执行精确推理而不是数值优化。 ?fitdistr
的文档说得很清楚:
For the Normal, log-Normal, geometric, exponential and Poisson
distributions the closed-form MLEs (and exact standard errors) are
used, and ‘start’ should not be supplied.
For all other distributions, direct optimization of the
log-likelihood is performed using ‘optim’. The estimated standard
errors are taken from the observed information matrix, calculated
by a numerical approximation. For one-dimensional problems the
Nelder-Mead method is used and for multi-dimensional problems the
BFGS method, unless arguments named ‘lower’ or ‘upper’ are
supplied (when ‘L-BFGS-B’ is used) or ‘method’ is supplied
explicitly.
另一方面,fitdistrplus::fitdist
总是以数字方式进行推理,即使存在精确推理也是如此。当然,fitdist
的优点是可以使用更多的推理原理:
Fit of univariate distributions to non-censored data by maximum
likelihood (mle), moment matching (mme), quantile matching (qme)
or maximizing goodness-of-fit estimation (mge).
此回答的目的
这个答案将探讨正态分布的精确推理。它会有理论的味道,但没有似然原理的证明;只给出结果。基于这些结果,我们编写了自己的 R 函数进行精确推理,可以与 MASS::fitdistr
进行比较。另一方面,为了与 fitdistrplus::fitdist
进行比较,我们使用 optim
在数值上最小化负对数似然函数。
这是学习统计学和相对高级使用 optim
的绝佳机会。为了方便起见,我将估计尺度参数:方差,而不是标准误差。
正态分布的精确推断
自己写推理函数
下面的代码注释得很好。有一个开关exact
。如果设置FALSE
,则选择数值解。
## fitting a normal distribution
fitnormal <- function (x, exact = TRUE) {
if (exact) {
################################################
## Exact inference based on likelihood theory ##
################################################
## minimum negative log-likelihood (maximum log-likelihood) estimator of `mu` and `phi = sigma ^ 2`
n <- length(x)
mu <- sum(x) / n
phi <- crossprod(x - mu)[1L] / n # (a bised estimator, though)
## inverse of Fisher information matrix evaluated at MLE
invI <- matrix(c(phi, 0, 0, phi * phi), 2L,
dimnames = list(c("mu", "sigma2"), c("mu", "sigma2")))
## log-likelihood at MLE
loglik <- -(n / 2) * (log(2 * pi * phi) + 1)
## return
return(list(theta = c(mu = mu, sigma2 = phi), vcov = invI, loglik = loglik, n = n))
}
else {
##################################################################
## Numerical optimization by minimizing negative log-likelihood ##
##################################################################
## negative log-likelihood function
## define `theta = c(mu, phi)` in order to use `optim`
nllik <- function (theta, x) {
(length(x) / 2) * log(2 * pi * theta[2]) + crossprod(x - theta[1])[1] / (2 * theta[2])
}
## gradient function (remember to flip the sign when using partial derivative result of log-likelihood)
## define `theta = c(mu, phi)` in order to use `optim`
gradient <- function (theta, x) {
pl2pmu <- -sum(x - theta[1]) / theta[2]
pl2pphi <- -crossprod(x - theta[1])[1] / (2 * theta[2] ^ 2) + length(x) / (2 * theta[2])
c(pl2pmu, pl2pphi)
}
## ask `optim` to return Hessian matrix by `hessian = TRUE`
## use "..." part to pass `x` as additional / further argument to "fn" and "gn"
## note, we want `phi` as positive so box constraint is used, with "L-BFGS-B" method chosen
init <- c(sample(x, 1), sample(abs(x) + 0.1, 1)) ## arbitrary valid starting values
z <- optim(par = init, fn = nllik, gr = gradient, x = x, lower = c(-Inf, 0), method = "L-BFGS-B", hessian = TRUE)
## post processing ##
theta <- z$par
loglik <- -z$value ## flip the sign to get log-likelihood
n <- length(x)
## Fisher information matrix (don't flip the sign as this is the Hessian for negative log-likelihood)
I <- z$hessian / n ## remember to take average to get mean
invI <- solve(I, diag(2L)) ## numerical inverse
dimnames(invI) <- list(c("mu", "sigma2"), c("mu", "sigma2"))
## return
return(list(theta = theta, vcov = invI, loglik = loglik, n = n))
}
}
我们仍然使用之前的数据进行测试:
set.seed(0); x <- rnorm(500)
## exact inference
fit <- fitnormal(x)
#$theta
# mu sigma2
#-0.0002000485 0.9773790969
#
#$vcov
# mu sigma2
#mu 0.9773791 0.0000000
#sigma2 0.0000000 0.9552699
#
#$loglik
#[1] -703.7491
#
#$n
#[1] 500
hist(x, prob = TRUE)
curve(dnorm(x, fit$theta[1], sqrt(fit$theta[2])), add = TRUE, col = 2)
数值方法也相当准确,除了方差协方差在对角线上没有精确的0:
fitnormal(x, FALSE)
#$theta
#[1] -0.0002235315 0.9773732277
#
#$vcov
# mu sigma2
#mu 9.773826e-01 5.359978e-06
#sigma2 5.359978e-06 1.910561e+00
#
#$loglik
#[1] -703.7491
#
#$n
#[1] 500