计算列表中关键字的第一个实例,R 中没有重复计数
Count 1st instance of keyword in list with no duplicate counts in R
我有一个关键字列表:
library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))
我想将这些关键字与数据框列 (df$text) 中的文本进行匹配,并计算关键字在不同 data.frame (matchdf) 中出现的次数:
matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match
但是,我注意到此方法会计算关键字在列中的每次出现次数。例如)
"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"
然后 return 计数为 2。但是,我只想计算字段中 "decomposed" 的第一个实例。
我认为有一种方法可以只使用 str_count
计算第一个实例,但似乎没有。
在此示例中,stringr 不是绝对必要的,来自基础 R 的 grepl
就足够了。也就是说,如果您更喜欢包功能(正如@Chi-Pak 在评论中指出的那样)
,请使用 str_detect
而不是 grepl
library(stringr)
words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots",
"poor body", "poor","not suitable", "not possible")
df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")
matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)
# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))
# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))
matchdf
结果
Keywords matches1 matches2
1 decomposed 1 1
2 no diagnosis 0 0
3 decomposition 0 0
4 autolysed 0 0
5 maggots 0 0
6 poor body 0 0
7 poor 0 0
8 not suitable 0 0
9 not possible 0 0
我有一个关键字列表:
library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))
我想将这些关键字与数据框列 (df$text) 中的文本进行匹配,并计算关键字在不同 data.frame (matchdf) 中出现的次数:
matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match
但是,我注意到此方法会计算关键字在列中的每次出现次数。例如)
"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"
然后 return 计数为 2。但是,我只想计算字段中 "decomposed" 的第一个实例。
我认为有一种方法可以只使用 str_count
计算第一个实例,但似乎没有。
在此示例中,stringr 不是绝对必要的,来自基础 R 的 grepl
就足够了。也就是说,如果您更喜欢包功能(正如@Chi-Pak 在评论中指出的那样)
str_detect
而不是 grepl
library(stringr)
words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots",
"poor body", "poor","not suitable", "not possible")
df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")
matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)
# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))
# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))
matchdf
结果
Keywords matches1 matches2
1 decomposed 1 1
2 no diagnosis 0 0
3 decomposition 0 0
4 autolysed 0 0
5 maggots 0 0
6 poor body 0 0
7 poor 0 0
8 not suitable 0 0
9 not possible 0 0