具有 NA 的组值的密度图

Density plots for group values with NAs

我有一个带有 ID 和值的 df,我想为每个唯一 ID 生成一个密度图,并检查分布是否正常或 skewed.There 也是 NA 值,我不确定如何对待他们。我应该删除它们并创建密度图吗? ID之间的值范围也不同。

| ID       |  Values |
| -------- | ------- |
| F1       | 45      |
| F1       | 56      |
| F1       | NA      |
| F1       | 68      |
| F1       | 55      |
| F2       | 23      |
| F2       | 44      |
| F2       | 34      |
| F2       | NA      |
| F2       | NA      |
| F2       | 34      |
| F3       | 5055    | 
| F3       | 4567    |
| F3       | NA      | 
| F3       | 4789    |
| F3       | 5567    |
| F3       | 6002    |
| F4       | 9045    |
| F4       | 9500    | 
| F4       | 9760    |
| F4       | NA      |
| F4       | 9150    |
dput(df)
structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F1", "F1", 
"F2", "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", 
"F3", "F3", "F3", "F3", "F3", "F3", "F4", "F4", "F4", "F4", "F4", 
"F4", "F4", "F4"), Values = c(9.6, NA, 10.2, 9.8, 9.9, 9.9, 9.9, 
1.2, 1.2, 1.8, 1.5, 1.5, 1.6, 1.4, NA, 3266, 3256, 7044, 6868, 
NA, 3405, 3410, NA, 5567, 59.4, 56, 52.8, 52.4, 55.5, NA, NA, 
53.6)), class = "data.frame", row.names = c(NA, -32L))

更新答案 2

如果您有非常不同的轴,您可以在 facet_wrap() 调用中添加 scales = "free"scales = "free_x" 以增加灵活性。另外我刚刚发现 {ggplot2}geom_qq()geom_qq_line() 中有它自己的 qqplot 功能。正如我在下面提到的,这是一种更严格的评估数据正态性的方法。

library(tidyverse)

# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F1", "F1", 
                            "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F2", "F3", "F3", "F3", 
                            "F3", "F3", "F3", "F3", "F3", "F3", "F4", "F4", "F4", "F4", "F4", 
                            "F4", "F4", "F4"), Values = c(9.6, NA, 10.2, 9.8, 9.9, 9.9, 9.9, 
                                                          1.2, 1.2, 1.8, 1.5, 1.5, 1.6, 1.4, NA, 3266, 3256, 7044, 6868, 
                                                          NA, 3405, 3410, NA, 5567, 59.4, 56, 52.8, 52.4, 55.5, NA, NA, 
                                                          53.6)), class = "data.frame", row.names = c(NA, -32L))

# plot density of each series laid out in facets
df %>% 
  ggplot(aes(x = Values)) +
  geom_density() +
  facet_wrap(facets = vars(ID), ncol = 2, scales = "free")
#> Warning: Removed 6 rows containing non-finite values (stat_density).

# generate qqplot for each group to assess normality
df %>% 
  ggplot(aes(sample = Values)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(facets = vars(ID), ncol = 2, scales = "free")
#> Warning: Removed 6 rows containing non-finite values (stat_qq).
#> Warning: Removed 6 rows containing non-finite values (stat_qq_line).

reprex package (v2.0.0)

于 2021-08-05 创建

更新答案 1

回应评论中的澄清: 如果您有很多组要比较,您可能不想将它们全部堆叠在一起。相反,我建议分面到数组中的许多地块。请注意下面 facet_wrap 调用中 ncol 的使用。这将控制生成的绘图数组的维度。

library(tidyverse)

# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F2", "F2", 
                            "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"), 
                     Values = c(45, 56, NA, 68, 55, 23, 44, 34, NA, NA, 34, 12, 19, NA, 25, 36)), 
                row.names = c(NA, -16L), 
                class = "data.frame")

# plot density of each series laid out in facets
df %>% 
  ggplot(aes(x = Values)) +
  geom_density() +
  facet_wrap(facets = vars(ID), ncol = 2)
#> Warning: Removed 4 rows containing non-finite values (stat_density).

reprex package (v2.0.0)

于 2021-08-03 创建

原答案

在我手中,您的代码确实有效。我以更易于导入的方式将数据放在这里,以便人们可以为您进行测试和故障排除。但是,如果您真的想评估您的值是否服从正态分布,则 qqplot 可能是更好的选择。请参阅下面的两者 - 并注意默认情况下将删除 NA 值,因此无需显式删除它。此外,它在这么小的数据集上可能意义不大,但我认为这只是测试代码的示例。

library(tidyverse)

# set up data
df <- structure(list(ID = c("F1", "F1", "F1", "F1", "F1", "F2", "F2", 
                            "F2", "F2", "F2", "F2", "F3", "F3", "F3", "F3", "F3"), 
                     Values = c(45, 56, NA, 68, 55, 23, 44, 34, NA, NA, 34, 12, 19, NA, 25, 36)), 
                row.names = c(NA, -16L), 
                class = "data.frame")

# plot density of each series overlaid
df %>% 
  ggplot(aes(x = Values)) +
  geom_density(aes(color = ID, fill = ID), alpha = 0.4)
#> Warning: Removed 4 rows containing non-finite values (stat_density).

# generate qqplot for each group to assess normality
par(mfrow = c(2, 2))
df %>% 
  group_by(ID) %>% 
  group_split(.keep = F) %>% 
  lapply(unlist, use.names = F) %>% 
  lapply(., qqnorm)
#> [[1]]
#> [[1]]$x
#> [1] -1.0491314  0.2993069         NA  1.0491314 -0.2993069
#> 
#> [[1]]$y
#> [1] 45 56 NA 68 55
#> 
#> 
#> [[2]]
#> [[2]]$x
#> [1] -1.0491314  1.0491314 -0.2993069         NA         NA  0.2993069
#> 
#> [[2]]$y
#> [1] 23 44 34 NA NA 34
#> 
#> 
#> [[3]]
#> [[3]]$x
#> [1] -1.0491314 -0.2993069         NA  0.2993069  1.0491314
#> 
#> [[3]]$y
#> [1] 12 19 NA 25 36

reprex package (v2.0.0)

于 2021-08-03 创建