如何隔离指定单词旁边的单词

How to isolate a word next to a specified word

我的数据框有各种字符串。查看示例 df:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
    df <- data.frame(strings, stringsAsFactors = F)

我想将句子中的 第一个 单词和倒数第二个单词分开。倒数第二个总是在单词 "payment."

之前

这是我想要的 df 的样子:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)

生成的字符串不需要区分大小写。

我能够编写代码来获取句子中的第一个单词(在 space 处拆分)但不知道如何将单词向左(或向右)拉动物质)参考词,在这种情况下是"payment"。

使用 strsplitheadtail 函数:

outDF = do.call(rbind,lapply(DF$strings,function(x) {

#split string
strObj = unlist(strsplit(x,split=" "))

#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE) 

}))

outDF
#                                    strings QualityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

或:

使用 dplyr 和自定义函数:

customFn = function(x) { 
strObj = unlist(strsplit(x,split=" ")); 
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}

DF %>% 
dplyr::rowwise() %>% 
dplyr::do(customFn(.$strings))
df$QualityWord = sub("(\w+).*?$", "\1", df$strings)
df$PaymentWord = sub(".*?(\w+) payment$", "\1", df$strings)

df
#>                                     strings QualityWord PaymentWord
#> 1  Average complications and higher payment     Average      higher
#> 2 Average complications and average payment     Average     average
#> 3   Average complications and lower payment     Average       lower
#> 4      Average mortality and higher payment     Average      higher
#> 5      Better mortality and average payment      Better     average

正则表达式术语解释:

  • (\w+) = 匹配一个单词字符一次或多次,捕获为一组
  • .*? = 匹配任何东西,非贪婪
  • payment = 匹配一个 space 然后字符 payment
  • $ = 匹配字符串的结尾。
  • \1 = 用第一组中的模式替换模式。

我们可以使用 extract 来自 tidyr

library(tidyverse)
df %>%
   extract(strings, into = c("QaulityWord", "PaymentWord"),
           "^(\w+).*\b(\w+)\s+\w+$", remove = FALSE)
#                                   strings QaulityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average