根据 T/F 列仅粘贴向量的一部分

Question

假设我有一个包含显示颜色的物种的数据框。

df<-data.frame(name=paste("spec",1:5),
         ind=c("blue;green","red","green","red;green;blue",""))

和一些（蓝色和红色）的颜色实际上是有意义的。我可以 grepl() 他们得到一个 T/F-column

df$isredorblue<-grepl("blue|red",df$ind)

但现在我想知道 种有意义的 颜色显示在列中。

期望的结果是：

> df name ind isredorblue searchcolor 1 spec 1 blue;green TRUE blue 2 spec 2 red TRUE red 3 spec 3 green FALSE other 4 spec 4 red;green;blue TRUE red;blue 5 spec 5 FALSE other

我已经用 [^]+ 尝试过 gsub，但这实际上不起作用，因为它匹配所有字母，所以“r”或“e”或“d”不是“红色”...

> gsub("[^red]+","",df$ind) [1] "eree" "red" "ree" "redreee" ""

现在我正在考虑使用 strsplit...但似乎无法确定我的下一步

blabla<-strsplit(df$ind, split=";") blabla<-blabla[-which(!blabla %in% c("red","blue"))] > blabla [[1]] [1] "red"

请记住这是一个 reprex，我的实际数据框要大得多，并且有不同的指标“颜色”对不同的事物很重要，所以我需要能够以尽可能少的步骤生成这些列

Answer 1

这里有两种方法。

使用正则表达式 -

这会从 color 创建正则表达式模式以从数据中的 ind 列中提取。如果没有提取值，我们用 'other'.

替换空白

color <- c('red', 'blue')
pat <- paste0(color, collapse = '|')
df$is_color_present <- grepl(pat, df$ind)
df$searchcolor <- sapply(stringr::str_extract_all(df$ind, pat), paste0, collapse = ';')
df$searchcolor[df$searchcolor == ''] <- 'other'
df
#    name            ind is_color_present searchcolor
#1 spec 1     blue;green             TRUE        blue
#2 spec 2            red             TRUE         red
#3 spec 3          green            FALSE       other
#4 spec 4 red;green;blue             TRUE    red;blue
#5 spec 5                           FALSE       other

不使用正则表达式 tidyverse -

我们在 ; 上以长格式拆分数据并仅保留 color 中存在的那些值。

library(dplyr)
library(tidyr)

df %>%
  separate_rows(ind, sep = ';') %>%
  group_by(name) %>%
  summarise(is_color_present = any(ind %in% color), 
            searchcolor = paste0(ind[ind %in% color], collapse = ';'), 
            searchcolor = replace(searchcolor, searchcolor == '', 'other'))

Answer 2

这里有一个简洁的解决方案：

library(dplyr)
library(stringr)

首先将所有目标颜色定义为一个向量：

targets <- c('red', 'blue')

现在使用转换为正则表达式交替模式的向量在新列中提取所需的颜色：

df %>%
   mutate(colors = str_extract_all(ind, paste0(targets, collapse = "|")))
    name            ind    colors
1 spec 1     blue;green      blue
2 spec 2            red       red
3 spec 3          green          
4 spec 4 red;green;blue red, blue
5 spec 5

如果您有很多颜色名称，其中一些可能共享相同的字母（例如“red”和“darkred”），您可能需要在颜色名称周围环绕单词边界：

df %>%
  mutate(colors = str_extract_all(ind, paste0("\b(",paste0(targets, collapse = "|"), ")\b")))

这是另一个 dplyr 解决方案（虽然不是最简洁的）：

df %>%
  mutate(
    blue = ifelse(grepl("blue", ind), "blue","other"),
    red = ifelse(grepl("red", ind), "red","other"),
    target = ifelse(blue=="blue"|red=="red", paste(red, blue), "other"),
    target = sub("^other\s(?=blue|red)|(?<=blue|red)\sother$", "", target, perl = TRUE)) %>%
  select(-c(3:5))
    name            ind   target
1 spec 1     blue;green     blue
2 spec 2            red      red
3 spec 3          green    other
4 spec 4 red;green;blue red blue
5 spec 5                   other

数据：

df<-data.frame(name=paste("spec",1:5),
               ind=c("blue;green","red","green","red;green;blue",""))

根据 T/F 列仅粘贴向量的一部分

paste only part of a vector according to a T/F colum

string

r

gsub