使用带有 grepl 和循环的名称列表从字符串中提取名称,并将它们添加到 R 中的新列

Extract names from a string using a list of names with grepl and a loop and add them to a new column in R

我有一个数据集,其中一列包含姓名,一列表示此人白天做了什么。我试图弄清楚那天谁在我的数据集中与谁会面。我创建了一个包含数据集中名称的向量,并在循环中使用 grepl 来确定名称在详细说明 [=22= 的列中的位置] 数据集中的人。

name <- c("Dupont","Dupuy","Smith") 

activity <- c("On that day, he had lunch with Dupuy in London.", 
              "She had lunch with Dupont and then went to Brighton to meet Smith.", 
              "Smith remembers that he was tired on that day.")

met_with <- c("Dupont","Dupuy","Smith")

df<-data.frame(name, activity, met_with=NA)


for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}

但是,由于两个原因,此解决方案并不令人满意。当一个人遇到不止一个人时(在我的例子中是 Dupuy),我不能提取一个以上的名字,我不能告诉 R 不要 return 使用名字而不是这个人的名字我的 activity 专栏中的代词(例如 Smith)。

理想情况下,我希望 df 看起来像:

  name         activity                                            met_with                             
  Dupont       On that day, he had lunch with Dupuy in London.     Dupuy
  Dupuy        She had lunch with Dupont and then (...).           Dupont Smith
  Smith        Smith remembers that he was tired on that day.      NA

我正在清理字符串以构建边列表和节点列表,以便稍后进行网络分析。

谢谢

您可以使用 setdiff 排除行匹配的名称,并使用 gregexprregmatches 提取匹配的名称。也许还可以考虑在名称周围给出 \b

for(i in seq_len(nrow(df))) {
  df$met_with[i] <- paste(regmatches(df$activity[i],
   gregexpr(paste(setdiff(name, df$name[i]), collapse="|"),
   df$activity[i]))[[1]], collapse = " ")
}

df
#    name                                                           activity     met_with
#1 Dupont                    On that day, he had lunch with Dupuy in London.        Dupuy
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont Smith
#3  Smith                     Smith remembers that he was tired on that day.             

另一种使用 Reduce 的方法可能是:

df$met_with <- Reduce(function(x, y) {
  i <- grepl(y, df$activity, fixed = TRUE) & y != df$name
  x[i] <- lapply(x[i], `c`, y)
  x
}, unique(name), vector("list", nrow(df)))

df
#    name                                                           activity      met_with
#1 Dupont                    On that day, he had lunch with Dupuy in London.         Dupuy
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont, Smith
#3  Smith                     Smith remembers that he was tired on that day.          NULL

与@Gki 相同的逻辑,但使用 stringr 函数和 mapply 而不是循环。

library(stringr)

pat <- str_c('\b', df$name, '\b', collapse = '|')
df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '), 
       str_extract_all(df$activity, pat), df$name)

df

#    name                                                           activity
#1 Dupont                    On that day, he had lunch with Dupuy in London.
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
#3  Smith                     Smith remembers that he was tired on that day.

#      met_with
#1        Dupuy
#2 Dupont Smith
#3