如何将数据框与列表进行比较,并且数据框中的 return 值与列表匹配?
How to compare a data frame to a list, and return values in the data frame matching the list?
新手 R 问题总数。我有一个ID和注释的数据框df:
ID Notes
1 dogs are friendly
2 dogs and cats are pets
3 cows live on farms
4 cats and cows start with c
我还有另一个值列表"animals"
cats
cows
我想在我的数据框中添加另一列 "match",其中包含注释中的所有动物,例如
ID Notes Matches
1 dogs are friendly
2 dogs and cats are pets cats
3 cows live on farms cows
4 cats and cows start with c cats, cows
到目前为止,我唯一的幸运是使用 grepl 来 return 如果有任何匹配项:
grepl(paste(animals,collapse="|"),df$Notes,ignore.case = T)
如何 return 值来代替?
更新
我的数据框中有一些行,其中有多个猫实例,例如,在我的笔记中:
ID Notes Matches
1 dogs are friendly
2 dogs and cats are pets cats
3 cows live on farms cows
4 cats and cats cows start with c cats, cows
我只想return匹配的一个实例。 @LachlanO 让我非常了解他的解决方案,但我得到:
[1] "NA, NA" "cats, NA" "NA, cows" "c(\"cats\", \"cats\"), cows"
我怎样才能 return 只有不同的匹配项?
编辑: 添加了一个 unique
操作来处理重复匹配项。
我可以让你开始,然后给你指明方向:)
下面使用 stringr::str_extract_all 来提取我们需要的相关位,但不幸的是它给我们留下了我们不需要的位,最明显的是当它是空白的时候。我们自定义函数中间的 unique
函数只是确保我们逐个元素地进行唯一匹配。
ID = seq(1,4)
Notes <- c(
"dogs are friendly",
"dogs and cats are pets",
"cows live on farms",
"cats and cows start with c "
)
df <- data.frame(ID, Notes)
animals = c("cats", "cows")
matches <- as.data.frame(sapply(animals, function(x){sapply(stringr::str_extract_all(df$Notes, x), unique)}, simplify = TRUE))
matches[matches == "character(0)"] <- NA
apply(matches, 1, paste, collapse = ", ")
[1] "NA, NA" "cats, NA" "NA, cows" "cats, cows"
您可以将其设置为您的额外列,但由于这些 NA 而不是很好。如果有一个忽略 NA 的粘贴函数,我们就会被设置。
幸运的是另一个用户已经解决了这个问题:) Check out this answer here.
结合以上应该会给你一个合适的解决方案!
我会这样做:
animals = c("cats", "cows")
reg = paste(animals, collapse = "|")
library(stringr)
matches = str_extract_all(Notes, reg)
matches = lapply(matches, unique)
matches = sapply(matches, paste, collapse = ",")
df$matches = matches
df
# ID Notes matches
# 1 1 dogs are friendly
# 2 2 dogs and cats are pets cats
# 3 3 cows live on farms cows
# 4 4 cats and cows start with c cats,cows
如果你想花哨一点,可以在正则表达式上粘贴单词边界,例如 reg = paste("\b", animals, "\b", collapse = "|")
以避免提取单词的中间部分。
使用 LachlanO 提供的数据:
ID = seq(1,4)
Notes <- c(
"dogs are friendly",
"dogs and cats are pets",
"cows live on farms",
"cats and cows start with c "
)
df <- data.frame(ID, Notes)
您可以使用gsub
一次获得所有动物:
gsub(".*?(cows|cats )|.*","\1",do.call(paste,df),perl = T)
[1] "" "cats " "cows" "cats cows"
因此在一个通道中写入:
transform(df,matches=gsub(".*?(cows|cats )|.*","\1",do.call(paste,df),perl = T))
ID Notes matches
1 1 dogs are friendly
2 2 dogs and cats are pets cats
3 3 cows live on farms cows
4 4 cats and cows start with c cats cows
新手 R 问题总数。我有一个ID和注释的数据框df:
ID Notes
1 dogs are friendly
2 dogs and cats are pets
3 cows live on farms
4 cats and cows start with c
我还有另一个值列表"animals"
cats
cows
我想在我的数据框中添加另一列 "match",其中包含注释中的所有动物,例如
ID Notes Matches
1 dogs are friendly
2 dogs and cats are pets cats
3 cows live on farms cows
4 cats and cows start with c cats, cows
到目前为止,我唯一的幸运是使用 grepl 来 return 如果有任何匹配项:
grepl(paste(animals,collapse="|"),df$Notes,ignore.case = T)
如何 return 值来代替?
更新
我的数据框中有一些行,其中有多个猫实例,例如,在我的笔记中:
ID Notes Matches
1 dogs are friendly
2 dogs and cats are pets cats
3 cows live on farms cows
4 cats and cats cows start with c cats, cows
我只想return匹配的一个实例。 @LachlanO 让我非常了解他的解决方案,但我得到:
[1] "NA, NA" "cats, NA" "NA, cows" "c(\"cats\", \"cats\"), cows"
我怎样才能 return 只有不同的匹配项?
编辑: 添加了一个 unique
操作来处理重复匹配项。
我可以让你开始,然后给你指明方向:)
下面使用 stringr::str_extract_all 来提取我们需要的相关位,但不幸的是它给我们留下了我们不需要的位,最明显的是当它是空白的时候。我们自定义函数中间的 unique
函数只是确保我们逐个元素地进行唯一匹配。
ID = seq(1,4)
Notes <- c(
"dogs are friendly",
"dogs and cats are pets",
"cows live on farms",
"cats and cows start with c "
)
df <- data.frame(ID, Notes)
animals = c("cats", "cows")
matches <- as.data.frame(sapply(animals, function(x){sapply(stringr::str_extract_all(df$Notes, x), unique)}, simplify = TRUE))
matches[matches == "character(0)"] <- NA
apply(matches, 1, paste, collapse = ", ")
[1] "NA, NA" "cats, NA" "NA, cows" "cats, cows"
您可以将其设置为您的额外列,但由于这些 NA 而不是很好。如果有一个忽略 NA 的粘贴函数,我们就会被设置。
幸运的是另一个用户已经解决了这个问题:) Check out this answer here.
结合以上应该会给你一个合适的解决方案!
我会这样做:
animals = c("cats", "cows")
reg = paste(animals, collapse = "|")
library(stringr)
matches = str_extract_all(Notes, reg)
matches = lapply(matches, unique)
matches = sapply(matches, paste, collapse = ",")
df$matches = matches
df
# ID Notes matches
# 1 1 dogs are friendly
# 2 2 dogs and cats are pets cats
# 3 3 cows live on farms cows
# 4 4 cats and cows start with c cats,cows
如果你想花哨一点,可以在正则表达式上粘贴单词边界,例如 reg = paste("\b", animals, "\b", collapse = "|")
以避免提取单词的中间部分。
使用 LachlanO 提供的数据:
ID = seq(1,4)
Notes <- c(
"dogs are friendly",
"dogs and cats are pets",
"cows live on farms",
"cats and cows start with c "
)
df <- data.frame(ID, Notes)
您可以使用gsub
一次获得所有动物:
gsub(".*?(cows|cats )|.*","\1",do.call(paste,df),perl = T)
[1] "" "cats " "cows" "cats cows"
因此在一个通道中写入:
transform(df,matches=gsub(".*?(cows|cats )|.*","\1",do.call(paste,df),perl = T))
ID Notes matches
1 1 dogs are friendly
2 2 dogs and cats are pets cats
3 3 cows live on farms cows
4 4 cats and cows start with c cats cows