使用 ifelse 修改因子变量的水平

modifying levels in factor variable using ifelse

当我遇到这种奇怪的情况时,我想通过将两个级别组合为一个来修改因子变量中的级别。基本上,我的新关卡已创建,但所有剩余的关卡似乎都已移至下一个关卡。这是我的示例数据、使用的代码和输出。

library(tidyverse) 
data <- structure(list(factor1 = structure(c(1L, 1L, 2L, 3L, 1L, 2L, 
        1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 3L, 1L, 1L, 1L, 4L), .Label = c("0", "1", "2", "3", 
        "4", "5", "6", "7"), class = "factor")), row.names = c(NA, -30L
        ), class = c("tbl_df", "tbl", "data.frame"), .Names = "factor1")
data_out <- data %>% mutate(factor1 = ifelse(factor1 %in% c('0', '1'), 
                                             factor1, '>1'))
structure(list(factor1 = c("1", "1", "2", ">1", "1", "2", "1", 
"1", "2", "2", "2", "2", "2", "1", "2", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", ">1", "1", "1", "1", ">1")), .Names = "factor1", 
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))

这是令人满意的行为吗?这当然不是我的情况。怎么解释和改正?

我猜这个问题与构建因子的方式有关。我仍然不清楚一个因素如何通过 mutate 从 {"0", "1"} 的水平变为 {"1","2", ">1"} 的水平。

R 因子实际上是以 1 为基数的整数向量,其属性是它们的水平。因此,您的“0”级别最初实际上是整数 1,而您的“1”级别实际上是整数 2。显然 mutate 函数认为适合创建一个新因子,该因子具有打印为“>1”的附加级别,但也将“0”级别重新分配给新的“1”级别,将“1”级别重新分配给一个“2”级。对我来说,这看起来像是 mutate 部分的危险行为。我认为它应该给你一个级别为“0”、“1”、“>1”的新因素,或者它应该抛出一个错误。

错误来自 ifelse,尽管 mutate 通过将新列也作为一个因素使问题复杂化。如果你将 data 强制到一个数据框,那么你会看到:

data$factor2 <- ifelse( data$factor1 %in% c('0', '1'), 
                                              data$factor1, '>1')
data
#-------- same issue except
   factor1 factor2
1        0       1
2        0       1
3        1       2
4        2      >1
.... delete the other 26 rows
> str(data)
'data.frame':   30 obs. of  2 variables:
 $ factor1: Factor w/ 8 levels "0","1","2","3",..: 1 1 2 3 1 2 1 1 2 2 ...
 $ factor2: chr  "1" "1" "2" ">1" ...

这会让您留在 dplyr 套餐中:

recode_factor(data$factor1, `0` = "0", `1` = "1", .default=">1")
 [1] 0  0  1  >1 0  1  0  0  1  1  1  1  1  0  1  0  0  0  0  0  0  0  0  0  0  >1 0  0  0  >1
Levels: 0 1 >1

以防有人将来遇到类似问题并寻找一种简单的方法来对这些因素进行分组而不重新分配剩余的因素:

fct_collapse(data$factor1, '>1' = c('2', '3'))