将 gsub 与 filter() 和 grepl() 结合使用

Question

我正在尝试用 Normal 替换所有 Normal1, Normal2, Normal3 。

df=data.frame(col1=1:4, col2=c("Normal", "Normal2", "Normal3", "Normal"))

当我尝试这个时 df %>% filter(grepl("^Nor", col2)) %>% gsub("Normal.*","Normal", df$col2)

Warning message: In gsub(., "Normal.*", "Normal", df$col2) : argument 'pattern' has length > 1 and only the first element will be used

如何解决这个问题？谢谢。

Answer 1

这里涉及到两个概念：

当您传输数据时，%>% 告诉下一个函数使用 filter(grepl("^Nor", col2)) 产生的数据作为 first 参数下一个功能。 gsub 的参数列表的顺序不同于 tidyverse 函数：

grep, grepl, regexpr, gregexpr and regexec search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results. sub and gsub perform replacement of the first and all matches respectively.

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

因此，x 参数是您尝试替换 "Normal" 值的向量在函数中的位置。 gsub 没有意识到您正试图将数据放在第一个参数以外的地方。

gsub 处理一列数据，而您传递给它的是一个数据框。按原样，您的管道有：
- 第 1 步：数据框
- 第 2 步：数据框
- 第 3 步：向量

您可以了解数据结构，以便将 gsub 公开为管道的直接参与者（正如您现在已经了解的那样）。 m-dz's answer to 介绍了如何做到这一点。本质上，您需要告诉您的代码它应该将上一步的数据参数传递到某处 而不是下一个函数的第一个参数。

也就是说，我强烈推荐 G. Grothendieck 建议的方法。具体来说，将您使用 gsub 进行的数据清理放在 mutate 函数中。

我认为这是更好的方法有几个原因：

这是明确的 - 许多人使用 dplyr，并且知道 mutate 的作用。通过将数据清理步骤放入 mutate，您是在对其他人（包括 future you）说："in this step, I am modifying col2, and here's how I am modifying it."
这使得将数据传递到 gsub 中的任意位置变得更加容易。在 mutate 中，数据参数是第一个参数，和它将该参数公开给 define/modify 数据框的函数。这使得在第一个参数以外的函数中的其他地方引用数据变得容易。

我从 iris 数据集构建了一个可重现的示例：

iris %>%
    # create a fake "col2" to demonstrate Normal1, Normal2, Normal3
    mutate(
        options = runif(nrow(iris)),
        col2 = ifelse(options  > 0.333, "Normal2", "Normal1"),
        col2 = ifelse(options > 0.666, "Normal3", col2),
        options = NULL) %>%
    filter(grepl("virginica", .$Species)) %>%
    # example of how wrapping gsub in mutate can accomplish the goal
    mutate(col2 = gsub("Normal.*", "Normal", .$col2))

替代mutate()

如果您真的承诺不使用 mutate，您还可以编写自己的函数来包装对 gsub 的调用并将数据帧作为其第一个参数。示例可能如下所示：

gsub_dataframe <- function(data, pattern, replacement, column) {
    data[column] <- gsub(pattern, replacement, data[[column]])
    return(data)
}

不过，我不推荐这样做，因为它会向分析管道添加更多自定义代码，而基于 mutate 的解决方案会做同样的事情，而且其他用户已经熟悉它。

将 gsub 与 filter() 和 grepl() 结合使用

Using gsub in combination with filter() and grepl()

r

filter

gsub

dplyr