提取R中满足两个条件的字符向量的句子
Extract sentences of a character vector satisfying two conditions in R
假设我们将一个全文文件作为字符向量加载到 R 中。我正在寻找一种代码,可以提取两个“.”之间的所有文本,只要在这两个句点之间存在 "and the" 和至少一个“%”。
character <- as.character("Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%.")
看一下这个简短的例子,我希望在某个地方得到类似
的输出
[1] Sony reported an increase, and the percent was posted at 1.0%.
[2] And the percent of increase for Best Buy was 2.5%.
快速解决方案:
library(magrittr)
"Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%." %>%
## split the string at the sentence boundaries
gsub("\.\s", "\.\t", .) %>%
strsplit("\t") %>% unlist() %>%
## keep only sentences that contain "and the" (irrespective of case)
grep("and the", x = ., value = TRUE, ignore.case = TRUE) %>%
## keep only the sentences that end with %.
grep("%\.$", x = ., value = TRUE) %>%
## remove leading white spaces
gsub("^\s?", "", x = .)
假设我们将一个全文文件作为字符向量加载到 R 中。我正在寻找一种代码,可以提取两个“.”之间的所有文本,只要在这两个句点之间存在 "and the" 和至少一个“%”。
character <- as.character("Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%.")
看一下这个简短的例子,我希望在某个地方得到类似
的输出[1] Sony reported an increase, and the percent was posted at 1.0%.
[2] And the percent of increase for Best Buy was 2.5%.
快速解决方案:
library(magrittr)
"Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%." %>%
## split the string at the sentence boundaries
gsub("\.\s", "\.\t", .) %>%
strsplit("\t") %>% unlist() %>%
## keep only sentences that contain "and the" (irrespective of case)
grep("and the", x = ., value = TRUE, ignore.case = TRUE) %>%
## keep only the sentences that end with %.
grep("%\.$", x = ., value = TRUE) %>%
## remove leading white spaces
gsub("^\s?", "", x = .)