通过拆分 `chr` 列创建新列，找到唯一值，对它们进行排序，删除某些值，然后将它们组合回一个字符串

Question

我在 R 中工作，使用 tidyverse 和 dplyr 函数生成新列，但是当我试图在字符串列。下面是问题的详细描述。

设置

假设我有一个名为 df 的小标题，其中有一个名为 col1 的 chr 列，其中包含字符串。实际上，这些字符串是由逗号 (", ") 分隔的值列表。这是 df 的样子：

library(tidyverse)
library(dplyr)

df = data.frame(id=c(1,2,3,4,5),
                col1=c("a, b, x, a","b, b","c, b, b, b", "b, x, b, c", "c")) %>%
  as_tibble()

print(df)

# # A tibble: 5 x 2    
#      id col1         
#   <dbl> <chr>        
# 1     1 a, b, x, a      
# 2     2 b, b         
# 3     3 c, b, b, b
# 4     4 b, x, b, c      
# 5     5 c

问题

我想在 col1 中找到 ", " 的任何地方分隔值，删除所有重复值，对唯一值进行排序，删除 "x" 值，然后将它们连接起来使用 ", " 作为多个唯一项之间的分隔符转换为字符串。

更实际地说，我想创建一个列，如下所示 col2:

# # A tibble: 5 x 3            
#      id col1          col2   
#   <dbl> <chr>         <chr>  
# 1     1 a, b, x, a    a, b   
# 2     2 b, b          b      
# 3     3 c, b, b, b    b, c   
# 4     4 b, x, b, c    b, c
# 5     5 c             c

到目前为止我的尝试

如果我只有一个字符串变量，我知道我可以通过几个步骤完成所有处理：

x = "b, x, b, c"
x_temp = unique(strsplit(x, ", ")[[1]])
x_simp = paste(sort(x_temp[x_temp != "x"]), collapse=", ")
print(x_simp)
# [1] "b, c"

但是，我很难将此过程转换回 mutate 函数：

newdf = df %>% 
  mutate(col2 = paste(sort(unique(strsplit(col1, ", ")[[1]])[unique(strsplit(col1, ", ")[[1]]) != "x"]), collapse=", "))

# A tibble: 5 x 3
#    id col1              col2 
# <dbl> <chr>             <chr>
#   1   1 a, b, x, a      a, b 
#   2   2 b, b            a, b 
#   3   3 c, b, b, b      a, b 
#   4   4 b, x, b, c      a, b 
#   5   5 c               a, b

总结

如何使用 tidyverse/dplyr 函数生成一个新列，该列是 tibble/df 之一的以下处理步骤的结果的专栏：

使用自定义分隔符拆分 string/character 列，
找到唯一值，
对它们进行排序，
删除不需要的值，
使用自定义分隔符将它们组合回单个 string/character 值
将结果放在新列中

Answer 1

我们可以在 tidyverse 中执行此操作，方法是用 separate_rows 拆分，然后在删除重复项

后按 paste 进行分组

library(dplyr)
library(tidyr)
library(stringr)
df %>%
   mutate(col2 = col1) %>% 
   separate_rows(col2) %>%
   distinct(across(everything())) %>% 
   group_by(id, col1) %>% 
   summarise(col2 = str_c(sort(col2[col2 != "x"]), collapse = ", "),
       .groups = 'drop')

-输出

# A tibble: 5 × 3
     id col1       col2 
  <dbl> <chr>      <chr>
1     1 a, b, x, a a, b 
2     2 b, b       b    
3     3 c, b, b, b b, c 
4     4 b, x, b, c b, c 
5     5 c          c

Answer 2

我不确定这有多有效，但我刚刚发现我可以使用 mapply 函数对输入小标题的所有行应用 custom-built 函数，如下所示：

myfunc = function(in_str){
  temp = unique(strsplit(in_str, ", ")[[1]])
  simp = paste(sort(temp[temp != "x"]), collapse=", ")
  return(simp)
}

newdf2 = df %>% 
  mutate(col2 = mapply(myfunc, col1))

print(newdf2)
# # A tibble: 5 x 3
#      id    col1          col2 
#   <dbl>    <chr>         <chr>
# 1     1    a, b, x, a    a, b 
# 2     2    b, b          b    
# 3     3    c, b, b, b    b, c 
# 4     4    b, x, b, c    b, c 
# 5     5    c             c

通过拆分 `chr` 列创建新列，找到唯一值，对它们进行排序，删除某些值，然后将它们组合回一个字符串

Creating new column by splitting a `chr` column, finding unique values, sorting them, removing certain values, and combining them back into one string

r

data-manipulation

dataframe

dplyr

tidyverse

设置

问题

到目前为止我的尝试

总结