当级别数低于给定阈值时，使用 dplyr 将数值变量转换为因子

Question

我想用 dplyr 在级别数低于给定阈值时将数值变量转换为因子。

这对于编码为数字“0/1”的二进制变量最有用。

示例数据：

threshold<-5

data<-data.frame(binary1=rep(c(0,1), 5), binary_2=sample(c(0,1), 10, replace = TRUE), multilevel=sample(c(1:4), 10, replace=TRUE), numerical=1:10)

> data
   binary1 binary_2 multilevel numerical
1        0        1          2         1
2        1        0          3         2
3        0        1          2         3
4        1        0          1         4
5        0        1          2         5
6        1        1          4         6
7        0        1          1         7
8        1        1          3         8
9        0        1          1         9
10       1        0          4        10

sapply(data, class)
   binary1   binary_2 multilevel  numerical 
 "numeric"  "numeric"  "integer"  "integer"

我可以使用 mutate()、across() 和 where() 轻松地将所有变量转换为因子，如下所示：

data<-data%>%mutate(across(where(is.numeric), as.factor))

> sapply(data, class)
   binary1   binary_2 multilevel  numerical 
  "factor"   "factor"   "factor"   "factor"

但是，我找不到一种方法来改变 where() 函数的多个条件，包括我的阈值参数。我想要这个输出：

sapply(data, class)
   binary1   binary_2 multilevel  numerical 
 "factor"  "factor"  "factor"  "integer"

尝试了以下方法，但失败了：

data%>%mutate(across(where(is.numeric & length(unique(.x))<threshold), as.factor))

错误信息：

Error: Problem with `mutate()` input `..1`.
x object '.x' not found
ℹ Input `..1` is `across(where(!is.factor & length(unique(.x)) < threshold), as.factor)`.
Run `rlang::last_error()` to see where the error occurred.

可能我对 across() 和 where() 理解不够。欢迎提出建议。

附加问题：为什么在 is.factor 之前包含一个否定运算符 (!) 会让我出错，而没有 (!) 的版本完全没问题？

data<-data%>%mutate(across(where(!is.factor), as.factor))

错误：mutate() 输入 ..1 有问题。 x 无效参数类型 ℹ 输入 ..1 是 across(where(!is.factor), as.factor)。运行 rlang::last_error() 查看错误发生的地方。

Answer 1

在 where 中使用匿名函数或 lambda 函数。

library(dplyr)

data <- data %>% 
     mutate(across(where(~is.numeric(.) && n_distinct(.) < threshold), factor))

sapply(data, class)

#   binary1   binary_2 multilevel  numerical 
#  "factor"   "factor"   "factor"  "integer"

为了回答您的附加问题，!is.factor 不是像 is.factor 这样的函数。函数使用方法同上。

data %>% mutate(across(where(~!is.factor(.)), factor))

Answer 2

使用data.table

library(data.table)
data1 <- setDT(data)[, lapply(.SD, function(x) 
        if(is.numeric(x) && uniqueN(x) < threshold) factor(x) else x)]

当级别数低于给定阈值时，使用 dplyr 将数值变量转换为因子

Convert numerical variables into factors when the number of levels is lower than a given threshold with dplyr

r

dplyr

across