计算字符串中完全匹配的单词数
Count number of exactly matching words in a string
我有一个 id
列和一个捕获人们输入的 text_entry
的列。
目标:将每个人的 text_entry
与 key
进行比较,并计算完美 键入单词的数量。
例如,如果我的输入是:
df <- tribble(~id, ~text_entry,
1, "It was a Saturday night in December.",
2, " It was a Saturday night",
3, "It wuz a Sturday nite in",
4, "IT WAS A SATURDAY",
5, "was a Saturday"); df
key <- "It was a Saturday night in December."
那么我想要的是:
df2 <- tribble(~id, ~text_entry, ~words_correct,
1, "It was a Saturday night in December.", 7, # whole string perfect
2, " It was a Saturday night", 5, # first 5 words perfect
3, "It wuz a Sturday nite in", 3, # misspelled "was", "Saturday" and "night"
4, "IT WAS A SATURDAY", 0, # case-sensitive
5, "was a Saturday", 3); df2 # ok to start several words into the key
我对 stringr
/stringi
解决方案完全不满意。 tidyverse
总是首选,但我迫切需要任何解决方案。
非常感谢,非常感谢您提前提供帮助和见解!
一种方法是在空格处拆分字符串并用 key
计算常用词。
library(tidyverse)
keywords <- strsplit(key, '\s+')[[1]]
df %>%
mutate(text = str_split(text_entry, '\s+'),
words_correct = map_dbl(text, ~sum(.x %in% keywords)))
# A tibble: 5 x 3
# id text_entry words_correct
# <dbl> <chr> <dbl>
#1 1 "It was a Saturday night in December." 7
#2 2 " It was a Saturday night" 5
#3 3 "It wuz a Sturday nite in" 3
#4 4 "IT WAS A SATURDAY" 0
#5 5 "was a Saturday" 3
我们也可以在 base R 中这样做:
df$words_correct <- sapply(strsplit(df$text_entry, '\s+'),
function(x) sum(x %in% keywords))
您可以提取 non-space 部分并将它们传递给 str_detect()
。
library(tidyverse)
df %>%
mutate(words_correct = map_dbl(str_extract_all(text_entry, "[^\s]+"),
~ sum(str_detect(key, .))))
# # A tibble: 5 x 3
# id text_entry words_correct
# <dbl> <chr> <dbl>
# 1 1 "It was a Saturday night in December." 7
# 2 2 " It was a Saturday night" 5
# 3 3 "It wuz a Sturday nite in" 3
# 4 4 "IT WAS A SATURDAY" 0
# 5 5 "was a Saturday" 3
我有一个 id
列和一个捕获人们输入的 text_entry
的列。
目标:将每个人的 text_entry
与 key
进行比较,并计算完美 键入单词的数量。
例如,如果我的输入是:
df <- tribble(~id, ~text_entry,
1, "It was a Saturday night in December.",
2, " It was a Saturday night",
3, "It wuz a Sturday nite in",
4, "IT WAS A SATURDAY",
5, "was a Saturday"); df
key <- "It was a Saturday night in December."
那么我想要的是:
df2 <- tribble(~id, ~text_entry, ~words_correct,
1, "It was a Saturday night in December.", 7, # whole string perfect
2, " It was a Saturday night", 5, # first 5 words perfect
3, "It wuz a Sturday nite in", 3, # misspelled "was", "Saturday" and "night"
4, "IT WAS A SATURDAY", 0, # case-sensitive
5, "was a Saturday", 3); df2 # ok to start several words into the key
我对 stringr
/stringi
解决方案完全不满意。 tidyverse
总是首选,但我迫切需要任何解决方案。
非常感谢,非常感谢您提前提供帮助和见解!
一种方法是在空格处拆分字符串并用 key
计算常用词。
library(tidyverse)
keywords <- strsplit(key, '\s+')[[1]]
df %>%
mutate(text = str_split(text_entry, '\s+'),
words_correct = map_dbl(text, ~sum(.x %in% keywords)))
# A tibble: 5 x 3
# id text_entry words_correct
# <dbl> <chr> <dbl>
#1 1 "It was a Saturday night in December." 7
#2 2 " It was a Saturday night" 5
#3 3 "It wuz a Sturday nite in" 3
#4 4 "IT WAS A SATURDAY" 0
#5 5 "was a Saturday" 3
我们也可以在 base R 中这样做:
df$words_correct <- sapply(strsplit(df$text_entry, '\s+'),
function(x) sum(x %in% keywords))
您可以提取 non-space 部分并将它们传递给 str_detect()
。
library(tidyverse)
df %>%
mutate(words_correct = map_dbl(str_extract_all(text_entry, "[^\s]+"),
~ sum(str_detect(key, .))))
# # A tibble: 5 x 3
# id text_entry words_correct
# <dbl> <chr> <dbl>
# 1 1 "It was a Saturday night in December." 7
# 2 2 " It was a Saturday night" 5
# 3 3 "It wuz a Sturday nite in" 3
# 4 4 "IT WAS A SATURDAY" 0
# 5 5 "was a Saturday" 3