因子到因子的条件重新编码

Question

我有一个问题，df，有一个因素，A，我希望：

1) C 和
的副本 2) 根据第二个变量重新编码，B.

目前我正在以这种迂回的方式进行。我对因子的条件重新编码感到很困惑。我也查看了 dplyr's recode，但找不到更聪明的方法。

library(tibble)
df  <- tibble(
  A = factor(c(NA, "b", "c")), 
  B = c(1,NA,3)
)

我最初的小标题

df
#> # A tibble: 3 x 2
#>        A     B
#>   <fctr> <dbl>
#> 1   <NA>     1
#> 2      b    NA
#> 3      c     3

我当前解决方案中的第 1 步

df$C <- with(df, ifelse(is.na(B), 'B is NA', A)) 
df
#> # A tibble: 3 x 3
#>        A     B       C
#>   <fctr> <dbl>   <chr>
#> 1   <NA>     1    <NA>
#> 2      b    NA B is NA
#> 3      c     3       2

我当前解决方案中的第 2 步

df$C <- dplyr::recode_factor(df$C, '2' = 'c')
df
#> # A tibble: 3 x 3
#>        A     B       C
#>   <fctr> <dbl>  <fctr>
#> 1   <NA>     1    <NA>
#> 2      b    NA B is NA
#> 3      c     3       c

我应该怎么做？

Answer 1

使用dplyr::if_else，将因子转换为字符，然后再次转换为因子：

library(dplyr)

df %>% 
  mutate(C = factor(if_else(is.na(B), "B is NA", as.character(A))))

# # A tibble: 3 x 3
#          A     B       C
#     <fctr> <dbl>  <fctr>
#   1   <NA>     1    <NA>
#   2      b    NA B is NA
#   3      c     3       c

Answer 2

转换发生在 ifelse。来自文档：

Value

A vector of the same length and attributes (including dimensions and "class") as test and data values from the values of yes or no. The mode of the answer will be coerced from logical to accommodate first any values taken from yes and then any values taken from no.

因为yes是"B is NA"，是字符向量，所以输出的是字符向量。来自 A 的值被转换为整数，然后再转换为字符，这是一个奇怪的实现结果。因子实际上是具有修改的 class 和 levels 属性的整数向量。

您也可以通过复制 A，将 "B is NA" 添加到可接受的级别，然后替换一个子集来实现。

df$C <- df$A
levels(df$C) <- c(levels(df$C), "B is NA")
df$C[is.na(df$B)] <- "B is NA"
df
# # A tibble: 3 x 3
#        A     B       C
#   <fctr> <dbl>  <fctr>
# 1   <NA>     1    <NA>
# 2      b    NA B is NA
# 3      c     3       c

请注意，如果您不将 "B is NA" 添加到级别，所有替换值都将是 NA 并带有警告。因子被限制为只能取特定值。如果你想添加一个新的，你必须明确地这样做。

因子到因子的条件重新编码

Conditional recoding of factor to factor

r

recode

tidyverse

tibble