R subset/keep 至少包含两个特定文本字符串的所有行
R subset/keep all rows with at least two specific text strings
我有一个包含不同文本摘录的数据框。
我希望对包含我的小词典(“贫困|报告|警报|inflation”)中至少 2 个术语 的所有观察结果进行子集化,或者同一个术语两次(例如 report 在文本中出现两次)。
texts <- data.frame(text = c("report highlights that poverty is widespread", "there is inflation", "alarming reports", "thanks for listening"), id = 1:4, group = 4:7)
texts[grepl("poverty|report|alarming|inflation", texts$text, ignore.case=T),]
# I don't want this: text id group
#1 report highlights that poverty is widespread 1 4
#2 there is inflation 2 5
#3 alarming reports 3 6
但我想要这个:
# text id group
#1 report highlights that poverty is widespread 1 4
#3 alarming reports 3 6
这个有用吗:
> library(stringr)
> library(dplyr)
> texts %>% filter(str_count(text, pattern = "poverty|report|alarming|inflation") > 1)
text id group
1 report highlights that poverty is widespread 1 4
2 alarming reports 3 6
>
尝试这种base R
方法:
#Data
texts <- data.frame(text = c("report highlights that poverty is widespread", "there is inflation", "alarming reports", "thanks for listening"), id = 1:4, group = 4:7,stringsAsFactors = F)
#Index
Index <- apply(texts[,1,drop=F],1,function(x)sum(grepl("poverty|report|alarming|inflation",
unlist(strsplit(x,split =' ')),
ignore.case=T)))
#Subset
texts[which(Index>=2),]
输出:
text id group
1 report highlights that poverty is widespread 1 4
3 alarming reports 3 6
我有一个包含不同文本摘录的数据框。
我希望对包含我的小词典(“贫困|报告|警报|inflation”)中至少 2 个术语 的所有观察结果进行子集化,或者同一个术语两次(例如 report 在文本中出现两次)。
texts <- data.frame(text = c("report highlights that poverty is widespread", "there is inflation", "alarming reports", "thanks for listening"), id = 1:4, group = 4:7)
texts[grepl("poverty|report|alarming|inflation", texts$text, ignore.case=T),]
# I don't want this: text id group
#1 report highlights that poverty is widespread 1 4
#2 there is inflation 2 5
#3 alarming reports 3 6
但我想要这个:
# text id group
#1 report highlights that poverty is widespread 1 4
#3 alarming reports 3 6
这个有用吗:
> library(stringr)
> library(dplyr)
> texts %>% filter(str_count(text, pattern = "poverty|report|alarming|inflation") > 1)
text id group
1 report highlights that poverty is widespread 1 4
2 alarming reports 3 6
>
尝试这种base R
方法:
#Data
texts <- data.frame(text = c("report highlights that poverty is widespread", "there is inflation", "alarming reports", "thanks for listening"), id = 1:4, group = 4:7,stringsAsFactors = F)
#Index
Index <- apply(texts[,1,drop=F],1,function(x)sum(grepl("poverty|report|alarming|inflation",
unlist(strsplit(x,split =' ')),
ignore.case=T)))
#Subset
texts[which(Index>=2),]
输出:
text id group
1 report highlights that poverty is widespread 1 4
3 alarming reports 3 6