使用带有 grepl 和循环的名称列表从字符串中提取名称,并将它们添加到 R 中的新列
Extract names from a string using a list of names with grepl and a loop and add them to a new column in R
我有一个数据集,其中一列包含姓名,一列表示此人白天做了什么。我试图弄清楚那天谁在我的数据集中与谁会面。我创建了一个包含数据集中名称的向量,并在循环中使用 grepl 来确定名称在详细说明 [=22= 的列中的位置] 数据集中的人。
name <- c("Dupont","Dupuy","Smith")
activity <- c("On that day, he had lunch with Dupuy in London.",
"She had lunch with Dupont and then went to Brighton to meet Smith.",
"Smith remembers that he was tired on that day.")
met_with <- c("Dupont","Dupuy","Smith")
df<-data.frame(name, activity, met_with=NA)
for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}
但是,由于两个原因,此解决方案并不令人满意。当一个人遇到不止一个人时(在我的例子中是 Dupuy),我不能提取一个以上的名字,我不能告诉 R 不要 return 使用名字而不是这个人的名字我的 activity 专栏中的代词(例如 Smith)。
理想情况下,我希望 df 看起来像:
name activity met_with
Dupont On that day, he had lunch with Dupuy in London. Dupuy
Dupuy She had lunch with Dupont and then (...). Dupont Smith
Smith Smith remembers that he was tired on that day. NA
我正在清理字符串以构建边列表和节点列表,以便稍后进行网络分析。
谢谢
您可以使用 setdiff
排除行匹配的名称,并使用 gregexpr
和 regmatches
提取匹配的名称。也许还可以考虑在名称周围给出 \b
。
for(i in seq_len(nrow(df))) {
df$met_with[i] <- paste(regmatches(df$activity[i],
gregexpr(paste(setdiff(name, df$name[i]), collapse="|"),
df$activity[i]))[[1]], collapse = " ")
}
df
# name activity met_with
#1 Dupont On that day, he had lunch with Dupuy in London. Dupuy
#2 Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont Smith
#3 Smith Smith remembers that he was tired on that day.
另一种使用 Reduce
的方法可能是:
df$met_with <- Reduce(function(x, y) {
i <- grepl(y, df$activity, fixed = TRUE) & y != df$name
x[i] <- lapply(x[i], `c`, y)
x
}, unique(name), vector("list", nrow(df)))
df
# name activity met_with
#1 Dupont On that day, he had lunch with Dupuy in London. Dupuy
#2 Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont, Smith
#3 Smith Smith remembers that he was tired on that day. NULL
与@Gki 相同的逻辑,但使用 stringr
函数和 mapply
而不是循环。
library(stringr)
pat <- str_c('\b', df$name, '\b', collapse = '|')
df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '),
str_extract_all(df$activity, pat), df$name)
df
# name activity
#1 Dupont On that day, he had lunch with Dupuy in London.
#2 Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
#3 Smith Smith remembers that he was tired on that day.
# met_with
#1 Dupuy
#2 Dupont Smith
#3
我有一个数据集,其中一列包含姓名,一列表示此人白天做了什么。我试图弄清楚那天谁在我的数据集中与谁会面。我创建了一个包含数据集中名称的向量,并在循环中使用 grepl 来确定名称在详细说明 [=22= 的列中的位置] 数据集中的人。
name <- c("Dupont","Dupuy","Smith")
activity <- c("On that day, he had lunch with Dupuy in London.",
"She had lunch with Dupont and then went to Brighton to meet Smith.",
"Smith remembers that he was tired on that day.")
met_with <- c("Dupont","Dupuy","Smith")
df<-data.frame(name, activity, met_with=NA)
for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}
但是,由于两个原因,此解决方案并不令人满意。当一个人遇到不止一个人时(在我的例子中是 Dupuy),我不能提取一个以上的名字,我不能告诉 R 不要 return 使用名字而不是这个人的名字我的 activity 专栏中的代词(例如 Smith)。
理想情况下,我希望 df 看起来像:
name activity met_with
Dupont On that day, he had lunch with Dupuy in London. Dupuy
Dupuy She had lunch with Dupont and then (...). Dupont Smith
Smith Smith remembers that he was tired on that day. NA
我正在清理字符串以构建边列表和节点列表,以便稍后进行网络分析。
谢谢
您可以使用 setdiff
排除行匹配的名称,并使用 gregexpr
和 regmatches
提取匹配的名称。也许还可以考虑在名称周围给出 \b
。
for(i in seq_len(nrow(df))) {
df$met_with[i] <- paste(regmatches(df$activity[i],
gregexpr(paste(setdiff(name, df$name[i]), collapse="|"),
df$activity[i]))[[1]], collapse = " ")
}
df
# name activity met_with
#1 Dupont On that day, he had lunch with Dupuy in London. Dupuy
#2 Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont Smith
#3 Smith Smith remembers that he was tired on that day.
另一种使用 Reduce
的方法可能是:
df$met_with <- Reduce(function(x, y) {
i <- grepl(y, df$activity, fixed = TRUE) & y != df$name
x[i] <- lapply(x[i], `c`, y)
x
}, unique(name), vector("list", nrow(df)))
df
# name activity met_with
#1 Dupont On that day, he had lunch with Dupuy in London. Dupuy
#2 Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont, Smith
#3 Smith Smith remembers that he was tired on that day. NULL
与@Gki 相同的逻辑,但使用 stringr
函数和 mapply
而不是循环。
library(stringr)
pat <- str_c('\b', df$name, '\b', collapse = '|')
df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '),
str_extract_all(df$activity, pat), df$name)
df
# name activity
#1 Dupont On that day, he had lunch with Dupuy in London.
#2 Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
#3 Smith Smith remembers that he was tired on that day.
# met_with
#1 Dupuy
#2 Dupont Smith
#3