核密度估计(概率密度函数)是错误的?
Kernel Density Estimate (Probability Density Function) is wrong?
我创建了一个直方图来显示连环杀手首次杀人的年龄密度,并试图在其上叠加概率密度函数。但是,当我在 ggplot2 中使用 geom_density() 函数时,我得到的密度函数看起来太小了(面积<1)。奇怪的是,通过改变直方图的 bin 宽度,密度函数也会发生变化(bin 宽度越小,密度函数似乎越拟合。我想知道是否有人有一些指导可以使这个函数更好地拟合及其面积远远低于 1?
#Histograms for Age of First Kill:
library(ggplot2)
AFKH <- ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)), show.legend = FALSE, binwidth = 3) + # density wasn't working, so had to use the ..count/../sum(..count..)
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) + # c =, for color, and l = for brightness, the #h = c() changes the color gradient
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density")+
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7, fill = "white",lwd =1, stat = "density")
AFKH
我们没有你的数据集,所以让我们做一个相当接近它的数据集:
set.seed(3)
df <- data.frame(AgeFirstKill = rgamma(100, 3, 0.2) + 10)
首先要注意的是密度曲线没有变化。仔细查看绘图上的 y 轴。您会注意到密度曲线的峰值没有变化,但仍保持在 0.06 左右。变化的是直方图条的高度,y轴也随之变化。
这是因为您没有将直方图条的高度除以它们的宽度来保留它们的面积。您的审美应该 ..count../sum(..count..)/binwidth
以保持此不变。
为了说明这一点,让我们将您的绘图代码包装在一个函数中,该函数允许您指定 bin 宽度,但在绘图时也会考虑 binwidth:
draw_it <- function(bw) {
ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)/bw), show.legend = FALSE,
binwidth = bw) +
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) +
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density") +
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7,
fill = "white",lwd =1, stat = "density")
}
现在我们可以做:
draw_it(bw = 1)
draw_it(bw = 3)
draw_it(bw = 7)
我创建了一个直方图来显示连环杀手首次杀人的年龄密度,并试图在其上叠加概率密度函数。但是,当我在 ggplot2 中使用 geom_density() 函数时,我得到的密度函数看起来太小了(面积<1)。奇怪的是,通过改变直方图的 bin 宽度,密度函数也会发生变化(bin 宽度越小,密度函数似乎越拟合。我想知道是否有人有一些指导可以使这个函数更好地拟合及其面积远远低于 1?
#Histograms for Age of First Kill:
library(ggplot2)
AFKH <- ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)), show.legend = FALSE, binwidth = 3) + # density wasn't working, so had to use the ..count/../sum(..count..)
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) + # c =, for color, and l = for brightness, the #h = c() changes the color gradient
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density")+
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7, fill = "white",lwd =1, stat = "density")
AFKH
我们没有你的数据集,所以让我们做一个相当接近它的数据集:
set.seed(3)
df <- data.frame(AgeFirstKill = rgamma(100, 3, 0.2) + 10)
首先要注意的是密度曲线没有变化。仔细查看绘图上的 y 轴。您会注意到密度曲线的峰值没有变化,但仍保持在 0.06 左右。变化的是直方图条的高度,y轴也随之变化。
这是因为您没有将直方图条的高度除以它们的宽度来保留它们的面积。您的审美应该 ..count../sum(..count..)/binwidth
以保持此不变。
为了说明这一点,让我们将您的绘图代码包装在一个函数中,该函数允许您指定 bin 宽度,但在绘图时也会考虑 binwidth:
draw_it <- function(bw) {
ggplot(df, aes(AgeFirstKill,fill = cut(AgeFirstKill, 100))) +
geom_histogram(aes(y=..count../sum(..count..)/bw), show.legend = FALSE,
binwidth = bw) +
scale_fill_discrete(h = c(200, 10), c = 100, l = 60) +
theme(axis.title=element_text(size=22,face="bold"),
plot.title = element_text(size=30, face = "bold"),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
labs(title = "Age of First kill",x = "Age of First Kill", y = "Density") +
geom_density(aes(AgeFirstKill, y = ..density..), alpha = 0.7,
fill = "white",lwd =1, stat = "density")
}
现在我们可以做:
draw_it(bw = 1)
draw_it(bw = 3)
draw_it(bw = 7)