计算连续的唯一字符串模式

Question

我有一个例子：

dat <- read.table(text="index  string
1      'I have first and second'
2      'I have first, first'
3      'I have second and first and thirdeen'", header=TRUE)


toMatch <-  c('first', 'second', 'third')

dat$count <- stri_count_regex(dat$string, paste0('\b',toMatch,'\b', collapse="|"))

dat

index                               string count
1     1              I have first and second     2
2     2                  I have first, first     2
3     3 I have second and first and thirdeen     2

我想在数据框中添加一个列数，它会告诉我每一行有多少个唯一的词。在这种情况下，所需的输出将是

index                               string count
1     1              I have first and second     2
2     2                  I have first, first     1
3     3 I have second and first and thirdeen     2

能否请教一下如何修改原来的公式？非常感谢

Answer 1

使用 base R，您可以执行以下操作：

sapply(dat$string, function(x) 
    {sum(sapply(toMatch, function(y) {grepl(paste0('\b', y, '\b'), x)}))})

哪个returns

[1] 2 1 2

希望对您有所帮助！

Answer 2

我们可以使用 stri_match_all 来代替它给我们精确匹配，然后在基础中使用 n_distinct 或 length(unique(x)) 计算不同的值。

library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\b',toMatch,'\b',
                    collapse="|")), n_distinct)

#[1] 2 1 2

或在基础 R 中类似

sapply(stri_match_all(dat$string, regex = paste0('\b',toMatch,'\b',
         collapse="|")), function(x) length(unique(x)))

#[1] 2 1 2

计算连续的唯一字符串模式

Count unique string patterns in a row

r

stringi