基于以特定字符开头的级别的自动重构？

Question

我正在寻找一种方法来根据级别中的某些模式自动重新编码变量内的因素。我打算将解决方案迭代到更大的数据集。

我有一个更大的数据集，其中包含我在下面显示的示例的多个实例。级别往往具有以下模式：

主要类别是1、2、3、4，11、12、13、14级是1级的子类别，希望能简化分组流程。我已经使用 fct_recode 成功执行了重构，但我的目的是将此过程扩展到遵循类似编码模式的其他变量。

library(tidyverse)

dat <- tribble(
  ~Ethnicity, 
  "1",
  "2",
  "3",
  "4",
  "11",
  "12",
  "13",
  "14",
  "11",
  "13",
  "12",
  "12",
  "11",
  "13")

dat <- mutate_at(dat, vars(Ethnicity), factor)

count(dat, Ethnicity)
#> # A tibble: 8 x 2
#>   Ethnicity     n
#>   <fct>     <int>
#> 1 1             1
#> 2 11            3
#> 3 12            3
#> 4 13            3
#> 5 14            1
#> 6 2             1
#> 7 3             1
#> 8 4             1

dat %>% 
  mutate(Ethnicity = fct_recode(Ethnicity,
                                "1" = "1",
                                "1" = "11",
                                "1" = "12",
                                "1" = "13",
                                "1" = "14"
                                )) %>% 
  count(Ethnicity)
#> # A tibble: 4 x 2
#>   Ethnicity     n
#>   <fct>     <int>
#> 1 1            11
#> 2 2             1
#> 3 3             1
#> 4 4             1

^{由 reprex package (v0.2.1)}

创建于 2019-05-31

如预期的那样，此方法成功地将 11、12、13 和 14 的子类别分组为 1。有没有办法在不手动更改每个子类别的级别的情况下执行此操作？将此过程扩展到具有相同模式的多个变量的一般方法是什么？谢谢。

Answer 1

一个选项是创建一个命名向量并计算 (!!!)

library(dplyr)
library(forcats)
lvls <- levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1]
nm1 <- setNames(lvls, rep(1, length(lvls)))
dat %>% 
     mutate(Ethnicity = fct_recode(Ethnicity, !!!nm1)) %>% 
     count(Ethnicity)
# A tibble: 4 x 2
#  Ethnicity     n
#  <fct>     <int>
#1 1            11
#2 2             1
#3 3             1
#4 4             1

或者另一种选择是根据 substring

设置 levels

levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1] <- 1
dat %>% 
   count(Ethnicity)

对于多列，使用mutate_at并指定感兴趣的变量

dat %>% 
    mutate_at(vars(colsOfInterst), list(~ fct_recode(., !!! nm1)))

Answer 2

您可以将 fct_collapse 与 grep / 正则表达式一起使用，并根据需要调整正则表达式模式：

dat %>%
  mutate(Ethnicity = fct_collapse(Ethnicity, 
                                  "1" = unique(grep("^1", Ethnicity, value = T)))) %>%
  count(Ethnicity)

# A tibble: 4 x 2
  Ethnicity     n
  <fct>     <int>
1 1            11
2 2             1
3 3             1
4 4             1

或者，这感觉有点老套，但您始终可以使用 ifelse 或 case_when:

dat %>%
  mutate(Ethnicity = factor(ifelse(startsWith(as.character(Ethnicity), "1"), 1, Ethnicity))) %>%
  count(Ethnicity)

# A tibble: 4 x 2
  Ethnicity     n
  <fct>     <int>
1 1            11
2 2             1
3 3             1
4 4             1

基于以特定字符开头的级别的自动重构？

Automatic refactoring based on levels beginning with a certain character?

refactoring

r

categorical-data

forcats