具有 NA 的组值的密度图
Density plots for group values with NAs
我有一个带有 ID 和值的 df,我想为每个唯一 ID 生成一个密度图,并检查分布是否正常或 skewed.There 也是 NA 值,我不确定如何对待他们。我应该删除它们并创建密度图吗? ID之间的值范围也不同。
| ID | Values |
| -------- | ------- |
| F1 | 45 |
| F1 | 56 |
| F1 | NA |
| F1 | 68 |
| F1 | 55 |
| F2 | 23 |
| F2 | 44 |
| F2 | 34 |
| F2 | NA |
| F2 | NA |
| F2 | 34 |
| F3 | 5055 |
| F3 | 4567 |
| F3 | NA |
| F3 | 4789 |
| F3 | 5567 |
| F3 | 6002 |
| F4 | 9045 |
| F4 | 9500 |
| F4 | 9760 |
| F4 | NA |
| F4 | 9150 |
dput(df)
structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F1", "F1",
"F2", "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3",
"F3", "F3", "F3", "F3", "F3", "F3", "F4", "F4", "F4", "F4", "F4",
"F4", "F4", "F4"), Values = c(9.6, NA, 10.2, 9.8, 9.9, 9.9, 9.9,
1.2, 1.2, 1.8, 1.5, 1.5, 1.6, 1.4, NA, 3266, 3256, 7044, 6868,
NA, 3405, 3410, NA, 5567, 59.4, 56, 52.8, 52.4, 55.5, NA, NA,
53.6)), class = "data.frame", row.names = c(NA, -32L))
更新答案 2
如果您有非常不同的轴,您可以在 facet_wrap()
调用中添加 scales = "free"
或 scales = "free_x"
以增加灵活性。另外我刚刚发现 {ggplot2}
在 geom_qq()
和 geom_qq_line()
中有它自己的 qqplot 功能。正如我在下面提到的,这是一种更严格的评估数据正态性的方法。
library(tidyverse)
# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F1", "F1",
"F2", "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3",
"F3", "F3", "F3", "F3", "F3", "F3", "F4", "F4", "F4", "F4", "F4",
"F4", "F4", "F4"), Values = c(9.6, NA, 10.2, 9.8, 9.9, 9.9, 9.9,
1.2, 1.2, 1.8, 1.5, 1.5, 1.6, 1.4, NA, 3266, 3256, 7044, 6868,
NA, 3405, 3410, NA, 5567, 59.4, 56, 52.8, 52.4, 55.5, NA, NA,
53.6)), class = "data.frame", row.names = c(NA, -32L))
# plot density of each series laid out in facets
df %>%
ggplot(aes(x = Values)) +
geom_density() +
facet_wrap(facets = vars(ID), ncol = 2, scales = "free")
#> Warning: Removed 6 rows containing non-finite values (stat_density).
# generate qqplot for each group to assess normality
df %>%
ggplot(aes(sample = Values)) +
geom_qq() +
geom_qq_line() +
facet_wrap(facets = vars(ID), ncol = 2, scales = "free")
#> Warning: Removed 6 rows containing non-finite values (stat_qq).
#> Warning: Removed 6 rows containing non-finite values (stat_qq_line).
由 reprex package (v2.0.0)
于 2021-08-05 创建
更新答案 1
回应评论中的澄清:
如果您有很多组要比较,您可能不想将它们全部堆叠在一起。相反,我建议分面到数组中的许多地块。请注意下面 facet_wrap
调用中 ncol
的使用。这将控制生成的绘图数组的维度。
library(tidyverse)
# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F2", "F2",
"F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"),
Values = c(45, 56, NA, 68, 55, 23, 44, 34, NA, NA, 34, 12, 19, NA, 25, 36)),
row.names = c(NA, -16L),
class = "data.frame")
# plot density of each series laid out in facets
df %>%
ggplot(aes(x = Values)) +
geom_density() +
facet_wrap(facets = vars(ID), ncol = 2)
#> Warning: Removed 4 rows containing non-finite values (stat_density).
由 reprex package (v2.0.0)
于 2021-08-03 创建
原答案
在我手中,您的代码确实有效。我以更易于导入的方式将数据放在这里,以便人们可以为您进行测试和故障排除。但是,如果您真的想评估您的值是否服从正态分布,则 qqplot 可能是更好的选择。请参阅下面的两者 - 并注意默认情况下将删除 NA 值,因此无需显式删除它。此外,它在这么小的数据集上可能意义不大,但我认为这只是测试代码的示例。
library(tidyverse)
# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F2", "F2",
"F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"),
Values = c(45, 56, NA, 68, 55, 23, 44, 34, NA, NA, 34, 12, 19, NA, 25, 36)),
row.names = c(NA, -16L),
class = "data.frame")
# plot density of each series overlaid
df %>%
ggplot(aes(x = Values)) +
geom_density(aes(color = ID, fill = ID), alpha = 0.4)
#> Warning: Removed 4 rows containing non-finite values (stat_density).
# generate qqplot for each group to assess normality
par(mfrow = c(2, 2))
df %>%
group_by(ID) %>%
group_split(.keep = F) %>%
lapply(unlist, use.names = F) %>%
lapply(., qqnorm)
#> [[1]]
#> [[1]]$x
#> [1] -1.0491314 0.2993069 NA 1.0491314 -0.2993069
#>
#> [[1]]$y
#> [1] 45 56 NA 68 55
#>
#>
#> [[2]]
#> [[2]]$x
#> [1] -1.0491314 1.0491314 -0.2993069 NA NA 0.2993069
#>
#> [[2]]$y
#> [1] 23 44 34 NA NA 34
#>
#>
#> [[3]]
#> [[3]]$x
#> [1] -1.0491314 -0.2993069 NA 0.2993069 1.0491314
#>
#> [[3]]$y
#> [1] 12 19 NA 25 36
由 reprex package (v2.0.0)
于 2021-08-03 创建
我有一个带有 ID 和值的 df,我想为每个唯一 ID 生成一个密度图,并检查分布是否正常或 skewed.There 也是 NA 值,我不确定如何对待他们。我应该删除它们并创建密度图吗? ID之间的值范围也不同。
| ID | Values |
| -------- | ------- |
| F1 | 45 |
| F1 | 56 |
| F1 | NA |
| F1 | 68 |
| F1 | 55 |
| F2 | 23 |
| F2 | 44 |
| F2 | 34 |
| F2 | NA |
| F2 | NA |
| F2 | 34 |
| F3 | 5055 |
| F3 | 4567 |
| F3 | NA |
| F3 | 4789 |
| F3 | 5567 |
| F3 | 6002 |
| F4 | 9045 |
| F4 | 9500 |
| F4 | 9760 |
| F4 | NA |
| F4 | 9150 |
dput(df)
structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F1", "F1",
"F2", "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3",
"F3", "F3", "F3", "F3", "F3", "F3", "F4", "F4", "F4", "F4", "F4",
"F4", "F4", "F4"), Values = c(9.6, NA, 10.2, 9.8, 9.9, 9.9, 9.9,
1.2, 1.2, 1.8, 1.5, 1.5, 1.6, 1.4, NA, 3266, 3256, 7044, 6868,
NA, 3405, 3410, NA, 5567, 59.4, 56, 52.8, 52.4, 55.5, NA, NA,
53.6)), class = "data.frame", row.names = c(NA, -32L))
更新答案 2
如果您有非常不同的轴,您可以在 facet_wrap()
调用中添加 scales = "free"
或 scales = "free_x"
以增加灵活性。另外我刚刚发现 {ggplot2}
在 geom_qq()
和 geom_qq_line()
中有它自己的 qqplot 功能。正如我在下面提到的,这是一种更严格的评估数据正态性的方法。
library(tidyverse)
# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F1", "F1",
"F2", "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3",
"F3", "F3", "F3", "F3", "F3", "F3", "F4", "F4", "F4", "F4", "F4",
"F4", "F4", "F4"), Values = c(9.6, NA, 10.2, 9.8, 9.9, 9.9, 9.9,
1.2, 1.2, 1.8, 1.5, 1.5, 1.6, 1.4, NA, 3266, 3256, 7044, 6868,
NA, 3405, 3410, NA, 5567, 59.4, 56, 52.8, 52.4, 55.5, NA, NA,
53.6)), class = "data.frame", row.names = c(NA, -32L))
# plot density of each series laid out in facets
df %>%
ggplot(aes(x = Values)) +
geom_density() +
facet_wrap(facets = vars(ID), ncol = 2, scales = "free")
#> Warning: Removed 6 rows containing non-finite values (stat_density).
# generate qqplot for each group to assess normality
df %>%
ggplot(aes(sample = Values)) +
geom_qq() +
geom_qq_line() +
facet_wrap(facets = vars(ID), ncol = 2, scales = "free")
#> Warning: Removed 6 rows containing non-finite values (stat_qq).
#> Warning: Removed 6 rows containing non-finite values (stat_qq_line).
由 reprex package (v2.0.0)
于 2021-08-05 创建更新答案 1
回应评论中的澄清:
如果您有很多组要比较,您可能不想将它们全部堆叠在一起。相反,我建议分面到数组中的许多地块。请注意下面 facet_wrap
调用中 ncol
的使用。这将控制生成的绘图数组的维度。
library(tidyverse)
# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F2", "F2",
"F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"),
Values = c(45, 56, NA, 68, 55, 23, 44, 34, NA, NA, 34, 12, 19, NA, 25, 36)),
row.names = c(NA, -16L),
class = "data.frame")
# plot density of each series laid out in facets
df %>%
ggplot(aes(x = Values)) +
geom_density() +
facet_wrap(facets = vars(ID), ncol = 2)
#> Warning: Removed 4 rows containing non-finite values (stat_density).
由 reprex package (v2.0.0)
于 2021-08-03 创建原答案
在我手中,您的代码确实有效。我以更易于导入的方式将数据放在这里,以便人们可以为您进行测试和故障排除。但是,如果您真的想评估您的值是否服从正态分布,则 qqplot 可能是更好的选择。请参阅下面的两者 - 并注意默认情况下将删除 NA 值,因此无需显式删除它。此外,它在这么小的数据集上可能意义不大,但我认为这只是测试代码的示例。
library(tidyverse)
# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F2", "F2",
"F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"),
Values = c(45, 56, NA, 68, 55, 23, 44, 34, NA, NA, 34, 12, 19, NA, 25, 36)),
row.names = c(NA, -16L),
class = "data.frame")
# plot density of each series overlaid
df %>%
ggplot(aes(x = Values)) +
geom_density(aes(color = ID, fill = ID), alpha = 0.4)
#> Warning: Removed 4 rows containing non-finite values (stat_density).
# generate qqplot for each group to assess normality
par(mfrow = c(2, 2))
df %>%
group_by(ID) %>%
group_split(.keep = F) %>%
lapply(unlist, use.names = F) %>%
lapply(., qqnorm)
#> [[1]]
#> [[1]]$x
#> [1] -1.0491314 0.2993069 NA 1.0491314 -0.2993069
#>
#> [[1]]$y
#> [1] 45 56 NA 68 55
#>
#>
#> [[2]]
#> [[2]]$x
#> [1] -1.0491314 1.0491314 -0.2993069 NA NA 0.2993069
#>
#> [[2]]$y
#> [1] 23 44 34 NA NA 34
#>
#>
#> [[3]]
#> [[3]]$x
#> [1] -1.0491314 -0.2993069 NA 0.2993069 1.0491314
#>
#> [[3]]$y
#> [1] 12 19 NA 25 36
由 reprex package (v2.0.0)
于 2021-08-03 创建