使用正则表达式匹配编辑(重新编码、折叠、排序)因子水平
Edit (recode, collapse, order) factor levels with regex matching
我发现在 R 中操作因子变量过于复杂。清洁因素时我经常想做的事情包括:
Resorting levels – 不仅仅是为了设置一个参考类别,而且还把所有级别按照逻辑(non-alphabetical顺序)汇总表。 x <- factor(x, levels = new.order)
重新编码/重命名因子级别 - 以简化名称 and/or 将多个类别合并为一组。对于 one-to-one 重新编码 levels(x) <- new.levels(x)
或 plyr::revalue
,请参阅 here or here 示例。 car::recode
可以在单个语句中执行多个 one-to-many 匹配,但不支持正则表达式匹配。
删除关卡——不只是删除未使用的关卡,而是将一些关卡设置为缺失。 (例如,那些有错误代码的)。
x <- factor(as.character(x), exclude = drop.levels)
添加级别 – 显示计数为零的类别。
如果有一个函数可以同时完成上述所有操作,允许模糊(正则表达式)匹配重新编码和删除因子,可以在其他函数中使用(例如 lapply
或 dplyr::mutate
),并具有简单(一致)的语法。
我已经在下面发布了我对此的最佳尝试作为答案,但是如果我遗漏了一个已经存在的函数或者代码是否可以改进,请告诉我。
编辑
我已经知道 forcats
包,它的副标题是 用于处理分类变量(因子)的工具 。该软件包有许多选项用于重新排序级别('fct_infreq'、'fct_reorder'、'fct_relevel'、...)、recoding/grouping 级别('fct_recode'、'fct_lump' , 'fct_collapse'), 降低等级 ('fct_recode'), 和增加等级 ('fct_expand').但是目前还没有支持正则匹配的计划(https://github.com/tidyverse/forcats/issues/214).
编辑:几年后,我在 github 上添加了 xfactor
函数来完成上述操作。它仍在进行中,所以如果有任何错误等,请告诉我。
devtools::install_github("jwilliman/xfactor")
library(xfactor)
# Create example factor
x <- xfactor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
levels(x)
#> [1] "catfish" "dirt" "dogfish" "mouse" "rabbit"
# Factor levels can be reordered by passing an unnamed vector to the levels
# statement. Levels not included in the replace statement get moved to the end
# or dropped if exclude = TRUE.
xfactor(x, levels = c("mouse", "rabbit"))
#> [1] dogfish rabbit catfish mouse dirt
#> Levels: mouse rabbit catfish dirt dogfish
xfactor(x, levels = c("mouse", "rabbit"), exclude = TRUE)
#> [1] <NA> rabbit <NA> mouse <NA>
#> Levels: mouse rabbit
# Factor levels can be recoded, collapse, and ordered by passing a named
# vector to the levels statement. Where the vector names are the new factor
# levels and the vector values are regex expressions for the old levels.
# Duplicated new levels will be collapsed.
xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou"))
#> [1] Sea Land Sea Land dirt
#> Levels: Sea Land dirt
# Factor levels can be dropped by passing a regex expression (or vector) to
# the exclude statement
xfactor(x, exclude = "fish")
#> [1] <NA> rabbit <NA> mouse dirt
#> Levels: dirt mouse rabbit
# The function will work within other functions
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(n = 1:5, x)
df %>%
mutate(y = xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou", "Air"), exclude = "di"))
#> n x y
#> 1 1 dogfish Sea
#> 2 2 rabbit Land
#> 3 3 catfish Sea
#> 4 4 mouse Land
#> 5 5 dirt <NA>
由 reprex package (v0.3.0)
于 2020-04-16 创建
我发现在 R 中操作因子变量过于复杂。清洁因素时我经常想做的事情包括:
Resorting levels – 不仅仅是为了设置一个参考类别,而且还把所有级别按照逻辑(non-alphabetical顺序)汇总表。
x <- factor(x, levels = new.order)
重新编码/重命名因子级别 - 以简化名称 and/or 将多个类别合并为一组。对于 one-to-one 重新编码
levels(x) <- new.levels(x)
或plyr::revalue
,请参阅 here or here 示例。car::recode
可以在单个语句中执行多个 one-to-many 匹配,但不支持正则表达式匹配。删除关卡——不只是删除未使用的关卡,而是将一些关卡设置为缺失。 (例如,那些有错误代码的)。
x <- factor(as.character(x), exclude = drop.levels)
添加级别 – 显示计数为零的类别。
如果有一个函数可以同时完成上述所有操作,允许模糊(正则表达式)匹配重新编码和删除因子,可以在其他函数中使用(例如 lapply
或 dplyr::mutate
),并具有简单(一致)的语法。
我已经在下面发布了我对此的最佳尝试作为答案,但是如果我遗漏了一个已经存在的函数或者代码是否可以改进,请告诉我。
编辑
我已经知道 forcats
包,它的副标题是 用于处理分类变量(因子)的工具 。该软件包有许多选项用于重新排序级别('fct_infreq'、'fct_reorder'、'fct_relevel'、...)、recoding/grouping 级别('fct_recode'、'fct_lump' , 'fct_collapse'), 降低等级 ('fct_recode'), 和增加等级 ('fct_expand').但是目前还没有支持正则匹配的计划(https://github.com/tidyverse/forcats/issues/214).
编辑:几年后,我在 github 上添加了 xfactor
函数来完成上述操作。它仍在进行中,所以如果有任何错误等,请告诉我。
devtools::install_github("jwilliman/xfactor")
library(xfactor)
# Create example factor
x <- xfactor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
levels(x)
#> [1] "catfish" "dirt" "dogfish" "mouse" "rabbit"
# Factor levels can be reordered by passing an unnamed vector to the levels
# statement. Levels not included in the replace statement get moved to the end
# or dropped if exclude = TRUE.
xfactor(x, levels = c("mouse", "rabbit"))
#> [1] dogfish rabbit catfish mouse dirt
#> Levels: mouse rabbit catfish dirt dogfish
xfactor(x, levels = c("mouse", "rabbit"), exclude = TRUE)
#> [1] <NA> rabbit <NA> mouse <NA>
#> Levels: mouse rabbit
# Factor levels can be recoded, collapse, and ordered by passing a named
# vector to the levels statement. Where the vector names are the new factor
# levels and the vector values are regex expressions for the old levels.
# Duplicated new levels will be collapsed.
xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou"))
#> [1] Sea Land Sea Land dirt
#> Levels: Sea Land dirt
# Factor levels can be dropped by passing a regex expression (or vector) to
# the exclude statement
xfactor(x, exclude = "fish")
#> [1] <NA> rabbit <NA> mouse dirt
#> Levels: dirt mouse rabbit
# The function will work within other functions
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(n = 1:5, x)
df %>%
mutate(y = xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou", "Air"), exclude = "di"))
#> n x y
#> 1 1 dogfish Sea
#> 2 2 rabbit Land
#> 3 3 catfish Sea
#> 4 4 mouse Land
#> 5 5 dirt <NA>
由 reprex package (v0.3.0)
于 2020-04-16 创建