计算连续的唯一字符串模式
Count unique string patterns in a row
我有一个例子:
dat <- read.table(text="index string
1 'I have first and second'
2 'I have first, first'
3 'I have second and first and thirdeen'", header=TRUE)
toMatch <- c('first', 'second', 'third')
dat$count <- stri_count_regex(dat$string, paste0('\b',toMatch,'\b', collapse="|"))
dat
index string count
1 1 I have first and second 2
2 2 I have first, first 2
3 3 I have second and first and thirdeen 2
我想在数据框中添加一个列数,它会告诉我每一行有多少个唯一的词。在这种情况下,所需的输出将是
index string count
1 1 I have first and second 2
2 2 I have first, first 1
3 3 I have second and first and thirdeen 2
能否请教一下如何修改原来的公式?非常感谢
使用 base R,您可以执行以下操作:
sapply(dat$string, function(x)
{sum(sapply(toMatch, function(y) {grepl(paste0('\b', y, '\b'), x)}))})
哪个returns
[1] 2 1 2
希望对您有所帮助!
我们可以使用 stri_match_all
来代替它给我们精确匹配,然后在基础中使用 n_distinct
或 length(unique(x))
计算不同的值。
library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\b',toMatch,'\b',
collapse="|")), n_distinct)
#[1] 2 1 2
或在基础 R 中类似
sapply(stri_match_all(dat$string, regex = paste0('\b',toMatch,'\b',
collapse="|")), function(x) length(unique(x)))
#[1] 2 1 2
我有一个例子:
dat <- read.table(text="index string
1 'I have first and second'
2 'I have first, first'
3 'I have second and first and thirdeen'", header=TRUE)
toMatch <- c('first', 'second', 'third')
dat$count <- stri_count_regex(dat$string, paste0('\b',toMatch,'\b', collapse="|"))
dat
index string count
1 1 I have first and second 2
2 2 I have first, first 2
3 3 I have second and first and thirdeen 2
我想在数据框中添加一个列数,它会告诉我每一行有多少个唯一的词。在这种情况下,所需的输出将是
index string count
1 1 I have first and second 2
2 2 I have first, first 1
3 3 I have second and first and thirdeen 2
能否请教一下如何修改原来的公式?非常感谢
使用 base R,您可以执行以下操作:
sapply(dat$string, function(x)
{sum(sapply(toMatch, function(y) {grepl(paste0('\b', y, '\b'), x)}))})
哪个returns
[1] 2 1 2
希望对您有所帮助!
我们可以使用 stri_match_all
来代替它给我们精确匹配,然后在基础中使用 n_distinct
或 length(unique(x))
计算不同的值。
library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\b',toMatch,'\b',
collapse="|")), n_distinct)
#[1] 2 1 2
或在基础 R 中类似
sapply(stri_match_all(dat$string, regex = paste0('\b',toMatch,'\b',
collapse="|")), function(x) length(unique(x)))
#[1] 2 1 2