如何一次汇总多列？

Question

我正在尝试使用样本 ID、原始计数和基因名称创建 table。

在此table中，为每个样本 ID 创建一个新行以对应每个基因名称：

Sample ID	Gene A	Gene B
Sample 1	1	-
Sample 1	-	2
Sample 2	3	-
Sample 2	-	4

我不想有很多行，而是想将它们压缩成一行：

Sample ID	Gene A	Gene B
Sample 1	1	2
Sample 2	3	4

这是我目前的代码：

dfwide = data.wide.df %>% group_by(SampleId) %>%
summarise(Sample 1 = sum(Sample 1, na.rm = T),
Sample 2 = sum(Sample 2, na.rm = T))

我有超过 1000 个样本，所以我希望找到一种方法来一次总结所有基因。如有任何帮助，我们将不胜感激！

Answer 1

如果您始终保证 Gene A 的数量与 Gene B 的数量相同，那么这可能有效：

library(dplyr)
dat %>%
  group_by(Sample.ID) %>%
  summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 2 x 3
#   Sample.ID Gene.A Gene.B
#   <chr>     <chr>  <chr> 
# 1 Sample 1  1      2     
# 2 Sample 2  3      4

我假设您有文字 "-" 字符串；如果它们是 NA 或空的 ""，则可以修改该条件以说明这一点。

这里的风险是如果基因数量不均匀。例如，如果数据改为

dat2
#   Sample.ID Gene.A Gene.B
# 1  Sample 1      1      -
# 2  Sample 1      -      2
# 5  Sample 1      -      3
# 3  Sample 2      3      -
# 4  Sample 2      -      4

dat2 %>%
  group_by(Sample.ID) %>%
  summarize(across(starts_with("Gene"), ~ .[. != "-"]))
# # A tibble: 3 x 3
# # Groups:   Sample.ID [2]
#   Sample.ID Gene.A Gene.B
#   <chr>     <chr>  <chr> 
# 1 Sample 1  1      2     
# 2 Sample 1  1      3     
# 3 Sample 2  3      4

您会看到 1 是如何在多行中重复的；由于 R 的“回收”，这一次没有错误：由于 Gene.B 中有效字符串的数量是 Gene.A 中有效字符串数量的完美倍数，因此没有抱怨并且值是重复。我认为这里的回收可能不合适，所以可能不是你需要的。

如果是这种情况，以“长”格式存储可能更合适：

dat %>%
  tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\.(.*)", values_to = "Value") %>%
  filter(Value != "-")
# # A tibble: 4 x 3
#   Sample.ID Gene  Value
#   <chr>     <chr> <chr>
# 1 Sample 1  A     1    
# 2 Sample 1  B     2    
# 3 Sample 2  A     3    
# 4 Sample 2  B     4    
dat2 %>%
  tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\.(.*)", values_to = "Value") %>%
  filter(Value != "-")
# # A tibble: 5 x 3
#   Sample.ID Gene  Value
#   <chr>     <chr> <chr>
# 1 Sample 1  A     1    
# 2 Sample 1  B     2    
# 3 Sample 1  B     3    
# 4 Sample 2  A     3    
# 5 Sample 2  B     4

这可能需要您重构下游处理，但至少它是安全的。

数据：

dat <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "3", "-"), Gene.B = c("-", "2", "-", "4")), class = "data.frame", row.names = c(NA, -4L))
dat2 <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "-", "3", "-"), Gene.B = c("-", "2", "3", "-", "4")), row.names = c(1L, 2L, 5L, 3L, 4L), class = "data.frame")

如何一次汇总多列？

How to summarize multiple columns at once?

r

dplyr

summarize

tidyverse