使用查找另一个数据框替换一个数据框中的文本

Replace text in one data-frame using a look up to another data-frame

我的任务是搜索文本,用通用字符串替换人名和昵称。

这是我的名字和相应昵称数据框的结构:

names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)

这是我的包含文本的数据框的结构

text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)

提供这些数据帧

df_name_nick:

names nicknames
1  Thomas       Tom
2  Thomas     Tommy
3 Abigail       Abi
4 Abigail      Abby
5 Abigail     Abbey

df_name_comment:

text_names                       text_comment
1    Abigail           Tommy sits next to Abbey
2     Thomas As a footballer Tommy is very good
3    Abigail        Abby is a mature young lady
4     Thomas              Tom is a handsome man
5      Colin  Tom is friends with Colin and Abi

我正在寻找一个例程,它将搜索 df_name_comment 的每一行并使用 df_name_comment$text_names 从 df_name_nick 中查找相应的昵称和将其替换为 XXX。 注意每个人的名字可以有多个昵称。 请注意,在每个文本注释中,仅替换该行的适当名称,以便我们将其作为输出:

Abigail "Tommy sits next to XXX"
Thomas  "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas  "XXX is a handsome man"
Colin   "Tom is friends with Colin and Abi"

我认为这需要 gsubs、匹配和应用函数(mapply、sapply 等)的巧妙组合

我在 Stack Overflow 上搜索了与此请求类似的内容,但只能找到基于具有唯一行元素的数据框的非常具体的正则表达式解决方案,而不是我认为适用于通用文本查找和 gsub 的内容多个昵称。

谁能帮我解决一下我的困境? 感谢

内维尔 (自 2017 年 1 月以来的新手 R 程序员)

数据

df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)

解决方案 2

EDIT: In this initial solution I manually checked with grepl if the nickname was present, and then gsubbed with one of the matching ID's. I knew the '|' operator worked with grepl, but not with gsub. So credits to Sotos for that idea.

df = df_name_comment
for(i in 1:nrow(df))
{
  matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
  if(length(matching_nicknames)>0)
  {
    df$text_comment[i] = mapply(sub, pattern=paste(paste0("\b",matching_nicknames,"\b"),collapse="|"), "XXX", df$text_comment[i])
  }
}

输出

  text_names                      text_comment
1    Abigail            Tommy sits next to XXX
2     Thomas  As a footballer XXX is very good
3    Abigail        XXX is a mature young lady
4     Thomas             XXX is a handsome man
5      Colin Tom is friends with Colin and Abi

希望对您有所帮助!

这是一个基于 base R 的想法。我们基本上将每个名字的昵称粘贴在一起,由 | 折叠,以便在 gsub 中将其作为正则表达式传递并替换每个名字的匹配词评论 XXX。在将汇总的昵称与 df_name_comment.

合并后,我们使用 mapply 来执行此操作
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2

这给出了,

  text_names                      text_comment
1    Abigail            Tommy sits next to XXX
2    Abigail        XXX is a mature young lady
3      Colin Tom is friends with Colin and Abi
4     Thomas  As a footballer XXX is very good
5     Thomas             XXX is a handsome man

注1:将nicknames中的NA替换为0是因为NA(不匹配元素默认填入merge)当传入 gsub

时,也会将注释字符串转换为 NA

Note2 由于merge,顺序也变了,不过大家可以按照平时的顺序排序。

Note3 最好将变量作为字符而不是因子。因此,您要么使用 stringsAsFactors = FALSE 读取数据帧,要么通过

进行转换
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)

编辑

根据您的评论,我们可以简单地将评论的名称与我们的聚合数据集匹配,将其保存在向量中并直接在原始数据框上使用 mapply,而无需合并然后删除变量,即

#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0

df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
                                               v1, df_name_comment$text_comment)

希望对您有所帮助!

l <- apply(df_name_comment, 1, function(x) 
  ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0, 
         gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
         x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)


如果它解决了您的问题,请不要忘记告诉我们:)