dplyr::filter 与因子的字符串表示形式的函数一起使用

Question

我有一个包含大约 20 列和大约 10^7 行的数据框。其中一列是一个 id 列，它是一个因素。我想通过因子水平的字符串表示的属性来过滤行。下面的代码实现了这一点，但在我看来真的很不优雅。特别是我必须创建一个相关 id 的向量，在我看来应该是不需要的。

有什么简化这个的建议吗？

library(dplyr)
library(tidyr)
library(gdata)

dat <- data.frame(id=factor(c("xxx-nld", "xxx-jap", "yyy-aus", "zzz-ita")))

europ.id <- function(id) {
  ctry.code <- substring(id, nchar(id)-2)
  ctry.code %in% c("nld", "ita")
}

ids <- levels(dat$id)
europ.ids <- subset(ids, europ.campaign(ids))

datx <- dat %>% filter(id %in% europ.ids) %>% drop.levels

Answer 1

Docento Discimus 在评论中给出了正确答案。首先解释一下我在不同的尝试中不断遇到的错误

> dat %>% filter(europ.id(id))
Error in nchar(id) : 'nchar()' requires a character vector
Calls: %>% ... filter_impl -> .Call -> europ.id -> substring -> nchar

然后请注意，他的解决方案有效，因为 grepl 在需要时将 as.character 应用于其参数（来自 man：寻找匹配项的字符向量，或者可以被 as.character 强制转换的对象到一个字符向量）。 as.character 的这种隐式应用也会在您使用 %in% 时发生。由于此解决方案也具有完美的性能，我们可以执行以下操作

dat %>% filter(europ.id(as.character(id)) %>% droplevels

或者为了让它更易读，将函数更新为

europ.id <- function(id) {
  ids <- as.character(id)
  ctry.code <- substring(ids, nchar(ids)-2)
  ctry.code %in% c("nld", "ita")
}

并使用

dat %>% filter(europ.id(id)) %>% droplevels

这和我要找的一模一样。

dplyr::filter 与因子的字符串表示形式的函数一起使用

dplyr::filter used with a function on string representation of factor

r

dplyr