R，聚合函数显然会导致列级别丢失？

Question

我刚刚在 RGui 中遇到了一个奇怪的情况...我使用了与往常相同的脚本来使我的 data.frame 成为 ggplot2 的正确形状。所以我的数据如下所示：

      time days treatment nucleic_acid habitat  parallel   disturbance     variable  cellcounts      value
1    1    2   control          dna   water        1         none     Proteobacteria       batch     0.000000000
2    2   22   control          dna   water        1         none     Proteobacteria       batch     0.003586543
3    1    2   treated          dna   water        1         none     Proteobacteria       batch     0.000000000
4    2   22   treated          dna   biofilm      1         none     Proteobacteria       NA        0.000000000

'data.frame':   185648 obs. of  10 variables:
 $ time        : int  5 5 5 5 5 5 6 6 6 6 ...
 $ days        : int  62 62 62 62 62 62 69 69 69 69 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 2 2 2 1 1 1 2 2 2 1 ...
 $ parallel    : int  1 2 3 1 2 3 1 2 3 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 1 1 1 1 1 1 1 1 1 1 ...
 $ habitat     : Factor w/ 2 levels "biofilm","water": 1 1 1 1 1 1 1 1 1 1 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: NA NA NA NA NA NA NA NA NA NA ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0 0 0 0 0 0 0 0 0 ...

我想 aggregate 计算最多 3 个平行线的平均值：

df_mean<-aggregate(value~time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean)

之后，"habitat" 列中的级别 "biofilm" 丢失。

df_mean<-droplevels(df_mean)

str(df_mean)
'data.frame':   44608 obs. of  9 variables:
 $ time        : int  1 2 1 2 1 2 1 2 1 2 ...
 $ days        : int  2 22 2 22 2 22 2 22 2 22 ...
 $ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2 1 1 2 2 1 1 ...
 $ nucleic_acid: Factor w/ 2 levels "cdna","dna": 2 2 2 2 2 2 2 2 2 2 ...
 $ habitat     : Factor w/ 1 level "water": 1 1 1 1 1 1 1 1 1 1 ...
 $ disturbance : Factor w/ 3 levels "high","low","none": 3 3 3 3 3 3 3 3 3 3 ...
 $ variable    : Factor w/ 656 levels "Proteobacteria",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ cellcounts  : Factor w/ 4 levels "batch","high",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value       : num  0 0.00359 0 0 0 ...

所以我花了很多时间（实际上我才意识到这一点，还有更多的问题现在似乎都与 aggregate 相关）对此进行了研究。我删除了 "cellcounts" 列并且它起作用了。有趣的是，在 "biofilm" 的情况下，列 "cellcounts" 和 "habitat" 总是带有相同的信息，因此是冗余的（"biofilm" 总是与 "NA" 一起出现）。这是原因吗？但它以前总是有效，所以我不明白这一点。 base::aggregate 函数或类似的东西有变化吗？你有什么解释吗？我使用的是 R-3.4.0，其他使用的软件包是 reshape、reshape2 和 ggplot2

非常感谢，一个困惑的疯狂圣诞老人

Answer 1

问题来自 NA，也许您的文件在过去以不同方式加载并且这些文件存储为字符串而不是 NA 值？这是通过将它们设置为 "NA" 字符串来解决它的方法：

levels(df$cellcounts) <- c(levels(df$cellcounts),"NA")
df$cellcounts[is.na(df$cellcounts)] <- "NA"
df_mean <- aggregate(value ~ time+days+treatment+nucleic_acid+habitat+disturbance+variable+cellcounts, data = df, mean,na.rm=TRUE)
df_mean<-droplevels(df_mean)
str(df_mean)

'data.frame':   4 obs. of  9 variables:
  $ time        : int  1 2 1 2
$ days        : int  2 22 2 22
$ treatment   : Factor w/ 2 levels "control","treated": 1 1 2 2
$ nucleic_acid: Factor w/ 1 level "dna": 1 1 1 1
$ habitat     : Factor w/ 2 levels "biofilm","water": 2 2 2 1
$ disturbance : Factor w/ 1 level "none": 1 1 1 1
$ variable    : Factor w/ 1 level "Proteobacteria": 1 1 1 1
$ cellcounts  : Factor w/ 2 levels "batch","NA": 1 1 1 2
$ value       : num  0 0.00359 0 0

数据

df <- read.table(text="      time days treatment nucleic_acid habitat  parallel   disturbance     variable  cellcounts      value
    1    1    2   control          dna   water        1         none     Proteobacteria       batch     0.000000000
                        2    2   22   control          dna   water        1         none     Proteobacteria       batch     0.003586543
                        3    1    2   treated          dna   water        1         none     Proteobacteria       batch     0.000000000
                        4    2   22   treated          dna   biofilm      1         none     Proteobacteria       NA        0.000000000

                        ",header=T)

R，聚合函数显然会导致列级别丢失？

R, aggregate function apparently causes loss of column levels?

aggregate

r

formula