如何在 str_count 中 "unwrap" 正则表达式助手，例如 [:digit:]？

Question

假设有以下数据，我想计算每行的唯一字符。

test <- data.frame(oe = c("A-1", "111", "-", "Sie befassen sich intensiv damit"))

所以我想我正在使用 [:graph:] 助手来捕获字母、数字和标点符号。然而，它给出了错误的结果，见下文：

library(tidyverse)
test %>%
  mutate(unique_chars_correct = sapply(tolower(oe), function(x) sum(str_count(x, c(letters, 0:9, "-")) > 0)),
         unique_chars_wrong   = sapply(tolower(oe), function(x) sum(str_count(x, "[:graph:]") > 0)))

给出：

                                oe unique_chars_correct unique_chars_wrong
1                           A-1\.                    3                  1
2                              111                    1                  1
3                                -                    1                  1
4 Sie befassen sich intensiv damit                   13                  1

我假设，使用 [:graph:] 类型的检查是否有任何字符满足成为 [:graph:] 的一部分，但想要做的是检查属于 [:graph:] 的每个元素.

Answer 1

[:graph:] 给出了总数，它没有区分 unique 个字符

> str_count(test$oe, "[:graph:]")
[1] 3 3 1

因此，当我们转换为逻辑 (> 0) 并取 sum 它 returns 只是 1

并且它不区分 numbers/letters/punct。

如果我们需要得到预期的

Reduce(`+`, lapply(c("[:alpha:]", "[:digit:]", "[:punct:]"), 
        function(x) str_count(tolower(test$oe), x) >0) )
[1] 3 1 1

或者可以拆分然后在 unique 值

上使用 [:graph:]

sapply(strsplit(tolower(test$oe), ""), function(x)
      sum(str_count(unique(x), "[:graph:]") > 0))
[1] 3 1 1

Answer 2

您可以为此使用反向引用和环视：

数据：

test <- data.frame(oe = c("A-1", "111", "-", "Abaa", "B cbb b"))

EDITED 解决办法：（也考虑了空格，不计算，以及大小写区分，哪些被忽略了=

library(stringr)
str_count(test$oe, "(?i)([^\s])(?!.*\1)")
[1] 3 1 1 2 2

这是如何工作的：

(?i): case-i不敏感匹配
([^\s]): 匹配任何不是空白字符的字符的捕获组
(?!：否定前瞻的开始，阻止匹配，因此包含在以下 str_count 操作中：
.*：任意字符出现零次或多次
\1: backreference recalling the exact match of the capturing group (.)因此，在负前瞻的上下文中，有效地防止匹配和计数它的任何重复
): 负前瞻结束

编辑:

或者您可以使用 dplyr:

library(dplyr)
test %>%
  mutate(
    # set to lower-case and remove whitespace:
    oe = tolower(gsub("\s", "", oe)),
    # split the strings into separate chars:
    oe_splt = str_split(oe, ""),
    # count unique chars:
    count_unq = lengths(sapply(oe_splt, function(x) unique(x))))

如何在 str_count 中 "unwrap" 正则表达式助手，例如 [:digit:]？

How to "unwrap" regexp helpers like [:digit:] in str_count?

regex

r

stringr