关键字在 R 中的字符串的上下文中重复多次

Keyword repeated multiple times in context from string in R

我有一个数据集 (z),其中的字符串在 z$txt 中很长。我还有一本需要识别的关键字字典 (incd)。在 z$inc.terms 列中。我需要所有关键字(相同的关键字可能在同一个字符串中重复 n 次,因此每次出现都需要这个)前后有 5 个字符(例如,这样我就可以在其上下文中看到“关键字”)。

#CREATE "z" DATASET
z<-data.frame(matrix("",3,3))
names(z)<-c("row","txt","inc.terms")
z$row<-c(1,2,3)
z[1,2]<-"I like the sky when the sky is blu not when the sky is grey"
z[2,2]<-"I like the mountains when the sky is blu not when the mountains are cloudy"
z[3,2]<-"I like the sky when the sky is dark in the mountains"

incd<-c("sky","mountains")                       #inclusion dictionary

这是我设法实现的,但它只是 returns 第一个关键字,我需要每个关键字(实际上这也不起作用,不知道为什么,但它在我的原始数据比较复杂,无法共享以保护数据)。

for(i in incd){
   for(j in z$row){
     z$inc.terms[z$row==j]<-paste(z$inc.term[z$row==j],paste(stringr::str_sub(stringr::str_split(z$txt[z$row==j],i,simplify=TRUE)[,1],-5,-1),i,stringr::str_sub(stringr::str_split(z$txt[z$row==j],i,simplify=TRUE)[,2],1,5)),sep=" /// ")
 }
}

这是我一直在使用的,但它 returns 每个关键字在每个单元格而不是每个单元格中第一次出现。

我希望 z$inc.terms 的结果如下 z$inc.terms:

z[1,3]  " the sky when" /// " the sky is b" /// " the sky is g"
z[2,3]  " the mountains when" /// " the sky is b" /// " the mountains are "
z[3,3]  " the sky when" /// " the sky is d" /// " the mountains"

如果你使用的是 base R

,你可以尝试 regmatches
transform(
    z,
    inc.terms = regmatches(
        txt,
        gregexpr(
            sprintf(".{0,5}(%s).{0,5}", paste0(incd, collapse = "|")),
            txt
        )
    )
)

这给出了

  row
1   1
2   2
3   3
                                                                         txt
1                I like the sky when the sky is blu not when the sky is grey
2 I like the mountains when the sky is blu not when the mountains are cloudy
3                       I like the sky when the sky is dark in the mountains
                                                inc.terms
1              the sky when,  the sky is b,  the sky is g
2  the mountains when,  the sky is b,  the mountains are
3             the sky when,  the sky is d,  the mountains

这是一个简洁的解决方案:

library(dplyr)
library(stringr)

z<-data.frame(matrix("",3,3))
names(z)<-c("row","txt","inc.terms")
z$row<-c(1,2,3)
z[1,2]<-"I like the sky when the sky is blu not when the sky is grey"
z[2,2]<-"I like the mountains when the sky is blu not when the mountains are cloudy"
z[3,2]<-"I like the sky when the sky is dark in the mountains"

incd<-c("sky","mountains")     
words <- paste(incd, collapse="|")

z <- z %>% 
  mutate(inc.terms = str_extract_all(z$txt, paste0(".{5}(", words, ").{5}")))
z
#>   row
#> 1   1
#> 2   2
#> 3   3
#>                                                                          txt
#> 1                I like the sky when the sky is blu not when the sky is grey
#> 2 I like the mountains when the sky is blu not when the mountains are cloudy
#> 3                       I like the sky when the sky is dark in the mountains
#>                                                 inc.terms
#> 1              the sky when,  the sky is b,  the sky is g
#> 2  the mountains when,  the sky is b,  the mountains are 
#> 3                             the sky when,  the sky is d
z$inc.terms
#> [[1]]
#> [1] " the sky when" " the sky is b" " the sky is g"
#> 
#> [[2]]
#> [1] " the mountains when" " the sky is b"       " the mountains are "
#> 
#> [[3]]
#> [1] " the sky when" " the sky is d"

reprex package (v2.0.1)

于 2022-05-06 创建