R：提取数据框列中的匹配字符串

Question

我有一个数据框和一组关键字。我想在数据框中创建一个与关键字中的任何字符串相匹配的新列，并在第二个数据框中创建不匹配的字符串。

关键字 <- c('yellow','blue','red','green','purple')

我的数据框

colour	id
blue	A234
blue,black	A5
yellow	A6
blue,green,purple	A7

我希望得到的是这样一个dataframe：

colour	id	match	non-match
blue	A234	blue	yellow,red,green,purple
blue,green	A5	blue,green	yellow,red,purple
yellow	A6	yellow	blue,red,green,purple
blue,green,purple	A7	blue,green,purple	yellow,red

我试过这个来获取匹配列：

df %>% mutate(match = str_extract(paste(keyword,collapse="|"), tolower(colour)))

但它只适用于第一行和第三行，而不适用于第二行和第四行。感谢对此的任何帮助，并获得一列不匹配的字符串。

Answer 1

这是一个基本的 R 解决方案。我们可以在行模式下使用 apply，并将 CSV 颜色字符串拆分为一个向量。然后，使用 %in% 找出不匹配的颜色应该是什么。

df$match <- df$colour
df$non_match <- apply(df, 1, function(x) {
    paste(keyword[!keyword %in% strsplit(x[1], ",", fixed=TRUE)[[1]]], collapse=",")
})
df

             colour   id             match               non_match
1              blue A234              blue yellow,red,green,purple
2        blue,green   A5        blue,green       yellow,red,purple
3            yellow   A6            yellow   blue,red,green,purple
4 blue,green,purple   A7 blue,green,purple              yellow,red

数据：

keyword <- c('yellow','blue','red','green','purple')
df <- data.frame(colour=c("blue", "blue,green", "yellow", "blue,green,purple"),
                 id=c("A234", "A5", "A6", "A7"), stringsAsFactors=FALSE)

Answer 2

让 separate_rows 中的每个 colour 以逗号分隔，对于每个 id，您可以使用 intersect 和 non_match 找到 match setdiff.

library(dplyr)
keyword <- c('yellow','blue','red','green','purple')

df %>%
  tidyr::separate_rows(colour, sep = ',\s*') %>%
  group_by(id) %>%
  summarise(match = toString(intersect(keyword, colour)), 
            non_match = toString(setdiff(keyword, colour)), 
            colour = toString(colour))

#  id    match               non_match                  colour             
#* <chr> <chr>               <chr>                      <chr>              
#1 A234  blue                yellow, red, green, purple blue               
#2 A5    blue                yellow, red, green, purple blue, black        
#3 A6    yellow              blue, red, green, purple   yellow             
#4 A7    blue, green, purple yellow, red                blue, green, purple

数据

df <- structure(list(colour =c("blue","blue,black", "yellow", "blue,green,purple"
), id = c("A234", "A5", "A6", "A7")),class = "data.frame",row.names = c(NA, -4L))

R：提取数据框列中的匹配字符串

R: Extract matching string in dataframe column

r

stringr

dplyr

tidyverse