如何隔离指定单词旁边的单词

Question

我的数据框有各种字符串。查看示例 df：

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
    df <- data.frame(strings, stringsAsFactors = F)

我想将句子中的 第一个 单词和倒数第二个单词分开。倒数第二个总是在单词 "payment."

之前

这是我想要的 df 的样子：

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)

生成的字符串不需要区分大小写。

我能够编写代码来获取句子中的第一个单词（在 space 处拆分）但不知道如何将单词向左（或向右）拉动物质）参考词，在这种情况下是"payment"。

Answer 1

使用 strsplit、head 和 tail 函数：

outDF = do.call(rbind,lapply(DF$strings,function(x) {

#split string
strObj = unlist(strsplit(x,split=" "))

#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE) 

}))

outDF
#                                    strings QualityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

或：

使用 dplyr 和自定义函数：

customFn = function(x) { 
strObj = unlist(strsplit(x,split=" ")); 
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}

DF %>% 
dplyr::rowwise() %>% 
dplyr::do(customFn(.$strings))

Answer 2

df$QualityWord = sub("(\w+).*?$", "\1", df$strings)
df$PaymentWord = sub(".*?(\w+) payment$", "\1", df$strings)

df
#>                                     strings QualityWord PaymentWord
#> 1  Average complications and higher payment     Average      higher
#> 2 Average complications and average payment     Average     average
#> 3   Average complications and lower payment     Average       lower
#> 4      Average mortality and higher payment     Average      higher
#> 5      Better mortality and average payment      Better     average

正则表达式术语解释：

(\w+) = 匹配一个单词字符一次或多次，捕获为一组
.*? = 匹配任何东西，非贪婪
payment = 匹配一个 space 然后字符 payment
$ = 匹配字符串的结尾。
\1 = 用第一组中的模式替换模式。

Answer 3

我们可以使用 extract 来自 tidyr

library(tidyverse)
df %>%
   extract(strings, into = c("QaulityWord", "PaymentWord"),
           "^(\w+).*\b(\w+)\s+\w+$", remove = FALSE)
#                                   strings QaulityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

如何隔离指定单词旁边的单词

How to isolate a word next to a specified word

string

r

stringr