字符串拆分和条件粘贴
string split and conditional paste
我正在处理如下数据框
id Comments
1 The apple fell far from the mango tree
2 I was born under a mango tree and a wandering star
3 Mules are made for packing and Mangoes for eating
我对 mango 这个词之前的 4 个词和之后的 4 个词感兴趣,包括 mango 这个词。
最终数据集将如下所示。
id Comments
1 far from the mango tree
2 born under a mango tree and a
3 for packing and Mangoes for eating
这是测试可重现的数据集
df <- read.table(text="Id,Comment
1,The apple fell far from the mango tree
2,I was born under a mango tree and a wandering star
3,Mules are made for packing and Mangoes for eating", header=T, sep=",")
关于这个非常适用的任何见解
我使用了非常好的 stringi
包和正则表达式技术:
library(stringi)
apply(df,1, function(myrow){
stri_match_all_regex(myrow[2], "(\p{L}+\p{Z}){0,3}(mango\p{L}*|Mango\p{L}*)(\p{Z}\p{L}+){0,3}")[[1]][1,1]
})
所以我在 mango
((\p{L}+\p{Z}){0,3}
) 之前得到 0 到 3 个单词,在那个 mango 或 Mango 后面有额外的字母 ((mango\p{L}*|Mango\p{L}*)
) 之后又从0 到 3 个字 ((\p{Z}\p{L}+){0,3}
)
其中 \p{Z}
是一个空格,\p{L}
是一个字母。
这似乎有效:
sapply(
strsplit(as.character(df$Comment), " "),
function(x){
w = grep("[m|M]ango", x)[1]
paste(x[ seq(max(1,w-3), min(length(x),w+3)) ], collapse=" ")
}
)
# [1] "far from the mango tree"
# [2] "born under a mango tree and a"
# [3] "for packing and Mangoes for eating"
grep(...)[1]
表示只使用第一个芒果匹配。
我正在处理如下数据框
id Comments
1 The apple fell far from the mango tree
2 I was born under a mango tree and a wandering star
3 Mules are made for packing and Mangoes for eating
我对 mango 这个词之前的 4 个词和之后的 4 个词感兴趣,包括 mango 这个词。
最终数据集将如下所示。
id Comments
1 far from the mango tree
2 born under a mango tree and a
3 for packing and Mangoes for eating
这是测试可重现的数据集
df <- read.table(text="Id,Comment
1,The apple fell far from the mango tree
2,I was born under a mango tree and a wandering star
3,Mules are made for packing and Mangoes for eating", header=T, sep=",")
关于这个非常适用的任何见解
我使用了非常好的 stringi
包和正则表达式技术:
library(stringi)
apply(df,1, function(myrow){
stri_match_all_regex(myrow[2], "(\p{L}+\p{Z}){0,3}(mango\p{L}*|Mango\p{L}*)(\p{Z}\p{L}+){0,3}")[[1]][1,1]
})
所以我在 mango
((\p{L}+\p{Z}){0,3}
) 之前得到 0 到 3 个单词,在那个 mango 或 Mango 后面有额外的字母 ((mango\p{L}*|Mango\p{L}*)
) 之后又从0 到 3 个字 ((\p{Z}\p{L}+){0,3}
)
其中 \p{Z}
是一个空格,\p{L}
是一个字母。
这似乎有效:
sapply(
strsplit(as.character(df$Comment), " "),
function(x){
w = grep("[m|M]ango", x)[1]
paste(x[ seq(max(1,w-3), min(length(x),w+3)) ], collapse=" ")
}
)
# [1] "far from the mango tree"
# [2] "born under a mango tree and a"
# [3] "for packing and Mangoes for eating"
grep(...)[1]
表示只使用第一个芒果匹配。