关键字在 R 中的字符串的上下文中重复多次
Keyword repeated multiple times in context from string in R
我有一个数据集 (z
),其中的字符串在 z$txt
中很长。我还有一本需要识别的关键字字典 (incd
)。在 z$inc.terms
列中。我需要所有关键字(相同的关键字可能在同一个字符串中重复 n 次,因此每次出现都需要这个)前后有 5 个字符(例如,这样我就可以在其上下文中看到“关键字”)。
#CREATE "z" DATASET
z<-data.frame(matrix("",3,3))
names(z)<-c("row","txt","inc.terms")
z$row<-c(1,2,3)
z[1,2]<-"I like the sky when the sky is blu not when the sky is grey"
z[2,2]<-"I like the mountains when the sky is blu not when the mountains are cloudy"
z[3,2]<-"I like the sky when the sky is dark in the mountains"
incd<-c("sky","mountains") #inclusion dictionary
这是我设法实现的,但它只是 returns 第一个关键字,我需要每个关键字(实际上这也不起作用,不知道为什么,但它在我的原始数据比较复杂,无法共享以保护数据)。
for(i in incd){
for(j in z$row){
z$inc.terms[z$row==j]<-paste(z$inc.term[z$row==j],paste(stringr::str_sub(stringr::str_split(z$txt[z$row==j],i,simplify=TRUE)[,1],-5,-1),i,stringr::str_sub(stringr::str_split(z$txt[z$row==j],i,simplify=TRUE)[,2],1,5)),sep=" /// ")
}
}
这是我一直在使用的,但它 returns 每个关键字在每个单元格而不是每个单元格中第一次出现。
我希望 z$inc.terms
的结果如下 z$inc.terms
:
z[1,3] " the sky when" /// " the sky is b" /// " the sky is g"
z[2,3] " the mountains when" /// " the sky is b" /// " the mountains are "
z[3,3] " the sky when" /// " the sky is d" /// " the mountains"
如果你使用的是 base R
,你可以尝试 regmatches
transform(
z,
inc.terms = regmatches(
txt,
gregexpr(
sprintf(".{0,5}(%s).{0,5}", paste0(incd, collapse = "|")),
txt
)
)
)
这给出了
row
1 1
2 2
3 3
txt
1 I like the sky when the sky is blu not when the sky is grey
2 I like the mountains when the sky is blu not when the mountains are cloudy
3 I like the sky when the sky is dark in the mountains
inc.terms
1 the sky when, the sky is b, the sky is g
2 the mountains when, the sky is b, the mountains are
3 the sky when, the sky is d, the mountains
这是一个简洁的解决方案:
library(dplyr)
library(stringr)
z<-data.frame(matrix("",3,3))
names(z)<-c("row","txt","inc.terms")
z$row<-c(1,2,3)
z[1,2]<-"I like the sky when the sky is blu not when the sky is grey"
z[2,2]<-"I like the mountains when the sky is blu not when the mountains are cloudy"
z[3,2]<-"I like the sky when the sky is dark in the mountains"
incd<-c("sky","mountains")
words <- paste(incd, collapse="|")
z <- z %>%
mutate(inc.terms = str_extract_all(z$txt, paste0(".{5}(", words, ").{5}")))
z
#> row
#> 1 1
#> 2 2
#> 3 3
#> txt
#> 1 I like the sky when the sky is blu not when the sky is grey
#> 2 I like the mountains when the sky is blu not when the mountains are cloudy
#> 3 I like the sky when the sky is dark in the mountains
#> inc.terms
#> 1 the sky when, the sky is b, the sky is g
#> 2 the mountains when, the sky is b, the mountains are
#> 3 the sky when, the sky is d
z$inc.terms
#> [[1]]
#> [1] " the sky when" " the sky is b" " the sky is g"
#>
#> [[2]]
#> [1] " the mountains when" " the sky is b" " the mountains are "
#>
#> [[3]]
#> [1] " the sky when" " the sky is d"
由 reprex package (v2.0.1)
于 2022-05-06 创建
我有一个数据集 (z
),其中的字符串在 z$txt
中很长。我还有一本需要识别的关键字字典 (incd
)。在 z$inc.terms
列中。我需要所有关键字(相同的关键字可能在同一个字符串中重复 n 次,因此每次出现都需要这个)前后有 5 个字符(例如,这样我就可以在其上下文中看到“关键字”)。
#CREATE "z" DATASET
z<-data.frame(matrix("",3,3))
names(z)<-c("row","txt","inc.terms")
z$row<-c(1,2,3)
z[1,2]<-"I like the sky when the sky is blu not when the sky is grey"
z[2,2]<-"I like the mountains when the sky is blu not when the mountains are cloudy"
z[3,2]<-"I like the sky when the sky is dark in the mountains"
incd<-c("sky","mountains") #inclusion dictionary
这是我设法实现的,但它只是 returns 第一个关键字,我需要每个关键字(实际上这也不起作用,不知道为什么,但它在我的原始数据比较复杂,无法共享以保护数据)。
for(i in incd){
for(j in z$row){
z$inc.terms[z$row==j]<-paste(z$inc.term[z$row==j],paste(stringr::str_sub(stringr::str_split(z$txt[z$row==j],i,simplify=TRUE)[,1],-5,-1),i,stringr::str_sub(stringr::str_split(z$txt[z$row==j],i,simplify=TRUE)[,2],1,5)),sep=" /// ")
}
}
这是我一直在使用的,但它 returns 每个关键字在每个单元格而不是每个单元格中第一次出现。
我希望 z$inc.terms
的结果如下 z$inc.terms
:
z[1,3] " the sky when" /// " the sky is b" /// " the sky is g"
z[2,3] " the mountains when" /// " the sky is b" /// " the mountains are "
z[3,3] " the sky when" /// " the sky is d" /// " the mountains"
如果你使用的是 base R
,你可以尝试regmatches
transform(
z,
inc.terms = regmatches(
txt,
gregexpr(
sprintf(".{0,5}(%s).{0,5}", paste0(incd, collapse = "|")),
txt
)
)
)
这给出了
row
1 1
2 2
3 3
txt
1 I like the sky when the sky is blu not when the sky is grey
2 I like the mountains when the sky is blu not when the mountains are cloudy
3 I like the sky when the sky is dark in the mountains
inc.terms
1 the sky when, the sky is b, the sky is g
2 the mountains when, the sky is b, the mountains are
3 the sky when, the sky is d, the mountains
这是一个简洁的解决方案:
library(dplyr)
library(stringr)
z<-data.frame(matrix("",3,3))
names(z)<-c("row","txt","inc.terms")
z$row<-c(1,2,3)
z[1,2]<-"I like the sky when the sky is blu not when the sky is grey"
z[2,2]<-"I like the mountains when the sky is blu not when the mountains are cloudy"
z[3,2]<-"I like the sky when the sky is dark in the mountains"
incd<-c("sky","mountains")
words <- paste(incd, collapse="|")
z <- z %>%
mutate(inc.terms = str_extract_all(z$txt, paste0(".{5}(", words, ").{5}")))
z
#> row
#> 1 1
#> 2 2
#> 3 3
#> txt
#> 1 I like the sky when the sky is blu not when the sky is grey
#> 2 I like the mountains when the sky is blu not when the mountains are cloudy
#> 3 I like the sky when the sky is dark in the mountains
#> inc.terms
#> 1 the sky when, the sky is b, the sky is g
#> 2 the mountains when, the sky is b, the mountains are
#> 3 the sky when, the sky is d
z$inc.terms
#> [[1]]
#> [1] " the sky when" " the sky is b" " the sky is g"
#>
#> [[2]]
#> [1] " the mountains when" " the sky is b" " the mountains are "
#>
#> [[3]]
#> [1] " the sky when" " the sky is d"
由 reprex package (v2.0.1)
于 2022-05-06 创建