试图理解 grepl：为什么 grepl("of implementation", df$var) return 比 grepl("implementation", df$var) 更少 TRUE？

Question

对于缺少可重现的代码提前表示歉意。我有一个名为 survey 的数据框。数据框中的一列 --- survey$Q17 --- 包含多个字符串响应（例如，“采购成本高，缺乏相关技术”）。我一直在使用 grepl 使用 grepl(needle, haystack) 命令的变体为每个可能的响应创建一个新列。

在尝试查找“实施成本高”的所有实例时，我发现了以下违反直觉的发现：

survey$Q17.hcoi <- (
  grepl("implementation",
        survey$Q17)
  )
table(survey$Q17.hcoi == "TRUE")

这 returns 27 正确。然而，下面的代码...

survey$Q17.hcoi <- (
  grepl("of implementation",
        survey$Q17)
  )
table(survey$Q17.hcoi == "TRUE")

...returns 26 正确。以下代码...

survey$Q17.hcoi <- (
  grepl("costs of implementation",
        survey$Q17)
  )
table(survey$Q17.hcoi == "TRUE")

也returns 26 正确。最后，下面的代码...

survey$Q17.hcoi <- (
  grepl("High costs of implementation",
        survey$Q17)
  )
table(survey$Q17.hcoi == "TRUE")

Returns 0 正确。

这令人费解。我认为 grepl 中最长的搜索短语（例如“实施的高成本”）优于较短的搜索短语（例如“实施”）。在这种情况下，它不是。最长的搜索短语返回 0 TRUE，而最短的 returns 27.

任何人都可以解释为什么会这样吗？在使用 grepl 之前，我已经使用 trimws(survey$Q17) 删除多余的空格，因为我认为这可能会避免一些问题。

Answer 1

这是对正则表达式及其工作原理的简单误解，我建议阅读 ?regex 帮助页面。

当不使用正则表达式分隔符时，正则表达式匹配整个字符串。 'of implementation' 将是“匹配任何包含 'of implementation' 的字符串”。 'High costs of implementation' 包含该子字符串，而如果您改为使用 'High costs of implementation' 这将查找包含该子字符串的任何字符串单词的确切顺序。因此，例如，这将不匹配字符串“of implementation”，因为它没有“*High costs *”作为后缀。

如果您想要匹配包含任何单词的任何字符串，您可以使用正则表达式或运算符 |.

grepl('High|cost|of|implementation', X)

用你的向量替换 X。并不是说 space " " 本身也是一个匹配的字符，所以 `* of implementation*' 与 'of implementation' 不同！

试图理解 grepl：为什么 grepl("of implementation", df$var) return 比 grepl("implementation", df$var) 更少 TRUE？

Trying to understand grepl: why does grepl("of implementation", df$var) return fewer TRUE than grepl("implementation", df$var)?

r

grepl