如何一次汇总多列?

How to summarize multiple columns at once?

我正在尝试使用样本 ID、原始计数和基因名称创建 table。

在此table中,为每个样本 ID 创建一个新行以对应每个基因名称:

Sample ID Gene A Gene B
Sample 1 1 -
Sample 1 - 2
Sample 2 3 -
Sample 2 - 4

我不想有很多行,而是想将它们压缩成一行:

Sample ID Gene A Gene B
Sample 1 1 2
Sample 2 3 4

这是我目前的代码:

dfwide = data.wide.df %>% group_by(SampleId) %>%
summarise(Sample 1 = sum(Sample 1, na.rm = T),
Sample 2 = sum(Sample 2, na.rm = T))

我有超过 1000 个样本,所以我希望找到一种方法来一次总结所有基因。如有任何帮助,我们将不胜感激!

如果您始终保证 Gene A 的数量与 Gene B 的数量相同,那么这可能有效:

library(dplyr)
dat %>%
  group_by(Sample.ID) %>%
  summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 2 x 3
#   Sample.ID Gene.A Gene.B
#   <chr>     <chr>  <chr> 
# 1 Sample 1  1      2     
# 2 Sample 2  3      4     

我假设您有文字 "-" 字符串;如果它们是 NA 或空的 "",则可以修改该条件以说明这一点。

这里的风险是如果基因数量不均匀。例如,如果数据改为

dat2
#   Sample.ID Gene.A Gene.B
# 1  Sample 1      1      -
# 2  Sample 1      -      2
# 5  Sample 1      -      3
# 3  Sample 2      3      -
# 4  Sample 2      -      4

dat2 %>%
  group_by(Sample.ID) %>%
  summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 3 x 3
# # Groups:   Sample.ID [2]
#   Sample.ID Gene.A Gene.B
#   <chr>     <chr>  <chr> 
# 1 Sample 1  1      2     
# 2 Sample 1  1      3     
# 3 Sample 2  3      4     

您会看到 1 是如何在多行中重复的;由于 R 的“回收”,这一次没有错误:由于 Gene.B 中有效字符串的数量是 Gene.A 中有效字符串数量的完美倍数,因此没有抱怨并且值是重复。我认为这里的回收可能不合适,所以可能不是你需要的。

如果是这种情况,以“长”格式存储可能更合适:

dat %>%
  tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\.(.*)", values_to = "Value") %>%
  filter(Value != "-")
# # A tibble: 4 x 3
#   Sample.ID Gene  Value
#   <chr>     <chr> <chr>
# 1 Sample 1  A     1    
# 2 Sample 1  B     2    
# 3 Sample 2  A     3    
# 4 Sample 2  B     4    
dat2 %>%
  tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\.(.*)", values_to = "Value") %>%
  filter(Value != "-")
# # A tibble: 5 x 3
#   Sample.ID Gene  Value
#   <chr>     <chr> <chr>
# 1 Sample 1  A     1    
# 2 Sample 1  B     2    
# 3 Sample 1  B     3    
# 4 Sample 2  A     3    
# 5 Sample 2  B     4    

这可能需要您重构下游处理,但至少它是安全的。


数据:

dat <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "3", "-"), Gene.B = c("-", "2", "-", "4")), class = "data.frame", row.names = c(NA, -4L))
dat2 <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "-", "3", "-"), Gene.B = c("-", "2", "3", "-", "4")), row.names = c(1L, 2L, 5L, 3L, 4L), class = "data.frame")