如何隔离指定单词旁边的单词
How to isolate a word next to a specified word
我的数据框有各种字符串。查看示例 df:
strings <- c("Average complications and higher payment",
"Average complications and average payment",
"Average complications and lower payment",
"Average mortality and higher payment",
"Better mortality and average payment")
df <- data.frame(strings, stringsAsFactors = F)
我想将句子中的 第一个 单词和倒数第二个单词分开。倒数第二个总是在单词 "payment."
之前
这是我想要的 df 的样子:
strings <- c("Average complications and higher payment",
"Average complications and average payment",
"Average complications and lower payment",
"Average mortality and higher payment",
"Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)
生成的字符串不需要区分大小写。
我能够编写代码来获取句子中的第一个单词(在 space 处拆分)但不知道如何将单词向左(或向右)拉动物质)参考词,在这种情况下是"payment"。
使用 strsplit
、head
和 tail
函数:
outDF = do.call(rbind,lapply(DF$strings,function(x) {
#split string
strObj = unlist(strsplit(x,split=" "))
#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE)
}))
outDF
# strings QualityWord PaymentWord
#1 Average complications and higher payment Average higher
#2 Average complications and average payment Average average
#3 Average complications and lower payment Average lower
#4 Average mortality and higher payment Average higher
#5 Better mortality and average payment Better average
或:
使用 dplyr
和自定义函数:
customFn = function(x) {
strObj = unlist(strsplit(x,split=" "));
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}
DF %>%
dplyr::rowwise() %>%
dplyr::do(customFn(.$strings))
df$QualityWord = sub("(\w+).*?$", "\1", df$strings)
df$PaymentWord = sub(".*?(\w+) payment$", "\1", df$strings)
df
#> strings QualityWord PaymentWord
#> 1 Average complications and higher payment Average higher
#> 2 Average complications and average payment Average average
#> 3 Average complications and lower payment Average lower
#> 4 Average mortality and higher payment Average higher
#> 5 Better mortality and average payment Better average
正则表达式术语解释:
(\w+)
= 匹配一个单词字符一次或多次,捕获为一组
.*?
= 匹配任何东西,非贪婪
payment
= 匹配一个 space 然后字符 payment
$
= 匹配字符串的结尾。
\1
= 用第一组中的模式替换模式。
我们可以使用 extract
来自 tidyr
library(tidyverse)
df %>%
extract(strings, into = c("QaulityWord", "PaymentWord"),
"^(\w+).*\b(\w+)\s+\w+$", remove = FALSE)
# strings QaulityWord PaymentWord
#1 Average complications and higher payment Average higher
#2 Average complications and average payment Average average
#3 Average complications and lower payment Average lower
#4 Average mortality and higher payment Average higher
#5 Better mortality and average payment Better average
我的数据框有各种字符串。查看示例 df:
strings <- c("Average complications and higher payment",
"Average complications and average payment",
"Average complications and lower payment",
"Average mortality and higher payment",
"Better mortality and average payment")
df <- data.frame(strings, stringsAsFactors = F)
我想将句子中的 第一个 单词和倒数第二个单词分开。倒数第二个总是在单词 "payment."
之前这是我想要的 df 的样子:
strings <- c("Average complications and higher payment",
"Average complications and average payment",
"Average complications and lower payment",
"Average mortality and higher payment",
"Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)
生成的字符串不需要区分大小写。
我能够编写代码来获取句子中的第一个单词(在 space 处拆分)但不知道如何将单词向左(或向右)拉动物质)参考词,在这种情况下是"payment"。
使用 strsplit
、head
和 tail
函数:
outDF = do.call(rbind,lapply(DF$strings,function(x) {
#split string
strObj = unlist(strsplit(x,split=" "))
#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE)
}))
outDF
# strings QualityWord PaymentWord
#1 Average complications and higher payment Average higher
#2 Average complications and average payment Average average
#3 Average complications and lower payment Average lower
#4 Average mortality and higher payment Average higher
#5 Better mortality and average payment Better average
或:
使用 dplyr
和自定义函数:
customFn = function(x) {
strObj = unlist(strsplit(x,split=" "));
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}
DF %>%
dplyr::rowwise() %>%
dplyr::do(customFn(.$strings))
df$QualityWord = sub("(\w+).*?$", "\1", df$strings)
df$PaymentWord = sub(".*?(\w+) payment$", "\1", df$strings)
df
#> strings QualityWord PaymentWord
#> 1 Average complications and higher payment Average higher
#> 2 Average complications and average payment Average average
#> 3 Average complications and lower payment Average lower
#> 4 Average mortality and higher payment Average higher
#> 5 Better mortality and average payment Better average
正则表达式术语解释:
(\w+)
= 匹配一个单词字符一次或多次,捕获为一组.*?
= 匹配任何东西,非贪婪payment
= 匹配一个 space 然后字符payment
$
= 匹配字符串的结尾。\1
= 用第一组中的模式替换模式。
我们可以使用 extract
来自 tidyr
library(tidyverse)
df %>%
extract(strings, into = c("QaulityWord", "PaymentWord"),
"^(\w+).*\b(\w+)\s+\w+$", remove = FALSE)
# strings QaulityWord PaymentWord
#1 Average complications and higher payment Average higher
#2 Average complications and average payment Average average
#3 Average complications and lower payment Average lower
#4 Average mortality and higher payment Average higher
#5 Better mortality and average payment Better average