Dplyr：仅当行值 > 0 时，才使用 summarize across 取列的平均值

Question

我有一个基因表达分数（细胞 x 基因）的数据框。我还将每个单元格所属的集群存储为一列。

我想计算一组基因（列）的每个聚类的平均表达值，但是，我只想在这些计算中包含 > 0 的值。

我的尝试如下：

test <- gene_scores_df2 %>% 
  select(all_of(gene_list), Clusters) %>%
  group_by(Clusters) %>%
  summarize(across(c(1:13), ~mean(. > 0)))

这会产生以下小标题：

# A tibble: 16 x 14
   Clusters SLC17A7  GAD1  GAD2 SLC32A1  GLI3   TNC PROX1  SCGN   LHX6 NXPH1 MEIS2 ZFHX3     C3
   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 C1         0.611 0.605 0.817   0.850 0.979 0.590 0.725 0.434 0.275  0.728 0.949 0.886 0.332 
 2 C10        0.484 0.401 0.434   0.401 0.791 0.387 0.431 0.362 0.204  0.652 0.715 0.580 0.186 
 3 C11        0.495 0.5   0.538   0.412 0.847 0.437 0.516 0.453 0.187  0.764 0.804 0.640 0.160 
 4 C12        0.807 0.626 0.559   0.703 0.942 0.448 0.644 0.366 0.403  0.702 0.917 0.859 0.228 
 5 C13        0.489 0.578 0.709   0.719 0.796 0.409 0.565 0.371 0.367  0.773 0.716 0.776 0.169 
 6 C14        0.541 0.347 0.330   0.388 0.731 0.281 0.438 0.279 0.198  0.577 0.777 0.633 0.128 
 7 C15        0.152 0.306 0.337   0.198 0.629 0.304 0.331 0.179 0.132  0.496 0.509 0.405 0.0556
 8 C16        0.402 0.422 0.542   0.418 0.813 0.514 0.614 0.287 0.267  0.729 0.574 0.737 0.279 
 9 C2         0.152 0.480 0.458   0.297 0.883 0.423 0.511 0.195 0.152  0.722 0.692 0.598 0.0632
10 C3         0.585 0.679 0.659   0.711 0.996 0.886 0.801 0.297 0.305  0.789 0.992 0.963 0.346 
11 C4         0.567 0.756 0.893   0.940 0.892 0.334 0.797 0.750 0.376  0.686 0.897 0.885 0.240 
12 C5         0.220 0.516 0.560   0.625 0.673 0.250 0.466 0.275 0.358  0.590 0.571 0.641 0.112 
13 C6         0.558 0.908 0.836   0.973 0.725 0.280 0.830 0.642 0.871  0.927 0.830 0.916 0.202 
14 C7         0.380 0.743 0.749   0.772 0.825 0.415 0.480 0.211 0.199  0.614 0.860 0.901 0.135 
15 C8         0.616 0.348 0.312   0.334 0.749 0.271 0.451 0.520 0.129  0.542 0.743 0.735 0.147 
16 C9         0.406 0.381 0.400   0.265 0.679 0.266 0.465 0.233 0.0820 0.648 0.565 0.557 0.119

但是，当我对照（我假设的）一个类似的程序在单个列上进行检查时，我得到了不同的平均值。

这是 SLC1747 的代码：

gene_scores_df2 %>% 
  select(SLC17A7, Clusters) %>%
  group_by(Clusters) %>%
  filter(SLC17A7 > 0) %>%
  summarize(SLC17A7 = mean(SLC17A7))

结果：

# A tibble: 16 x 2
   Clusters SLC17A7
   <chr>      <dbl>
 1 C1         0.780
 2 C10        1.42 
 3 C11        1.21 
 4 C12        1.64 
 5 C13        1.09 
 6 C14        1.83 
 7 C15        1.61 
 8 C16        0.968
 9 C2         1.09 
10 C3         0.512
11 C4         0.920
12 C5         1.53 
13 C6         0.814
14 C7         1.22 
15 C8         2.24 
16 C9         1.72

我不确定上面的第一次尝试到底出了什么问题。

如有任何建议，我们将不胜感激。

原始 df 的代码片段

# First 20 cols of:
gene_scores_df2 %>% 
       select(all_of(gene_list), Clusters) %>%
       group_by(Clusters)

structure(list(SLC17A7 = c(0.273, 0.722, 0.699, 0.71, 0.333, 
0.674, 0.63, 0.481, 0.274, 0.981, 0.586, 0.401, 0.325, 0.583, 
0, 0.348, 0.287, 0, 0.295, 0.351), GAD1 = c(0.355, 0.392, 0.455, 
0.34, 0.108, 1.169, 0, 0.426, 2.219, 0.099, 1.16, 0.332, 0.404, 
0.284, 0, 5.297, 0.518, 0.027, 1.19, 0.346), GAD2 = c(0.12, 0.562, 
0.337, 0.49, 0.095, 0.958, 0.09, 1.518, 1.464, 0.175, 0.419, 
0.536, 0.501, 1.103, 0.343, 0, 0.247, 0, 0.635, 0.906), SLC32A1 = c(0, 
0.97, 0.067, 0.999, 0.224, 1.04, 0, 2.569, 1.544, 0.059, 2.177, 
3.227, 3.603, 1.229, 0.102, 2.421, 0.055, 0.826, 2.646, 0.228
), GLI3 = c(1.527, 0.487, 0.341, 3.352, 0.346, 0.694, 1.395, 
0.767, 1.334, 1.373, 1.7, 2.216, 0.394, 1.029, 1.235, 0.55, 2.043, 
4.469, 2.901, 4.139), TNC = c(0, 0, 0.448, 0.03, 1.377, 0.045, 
0, 0.169, 0.123, 0, 0.188, 0.075, 0, 1.074, 0, 1.272, 0.124, 
0.505, 0.173, 0.889), PROX1 = c(0, 0.075, 0.167, 0.782, 0.802, 
0.561, 0.098, 0.734, 0.448, 1.645, 0.735, 0.795, 0.102, 0.317, 
0.124, 0.324, 0.352, 0.236, 0.826, 0.308), SCGN = c(0.696, 0.234, 
0, 0.202, 0.059, 0.162, 0, 0.653, 0.383, 0.42, 0.094, 0.779, 
0.228, 0.248, 0.171, 0.089, 0.081, 0.026, 0.159, 0), LHX6 = c(0, 
0, 0.134, 0.1, 0.829, 1.489, 0, 0.38, 0.526, 0.117, 0, 0.205, 
0.299, 2.235, 0, 1.335, 0, 0.115, 0.454, 0.108), NXPH1 = c(0.792, 
0.143, 0.175, 0.658, 0, 1.034, 1.798, 0.219, 0.896, 0.249, 1.336, 
1.507, 0.26, 0.242, 1.235, 2.16, 0.235, 0.349, 1.297, 2.234), 
    MEIS2 = c(4.337, 0.559, 0.978, 1.972, 0.964, 0.657, 0.162, 
    0.827, 0.882, 0.157, 1.494, 1.171, 2.524, 2.458, 0.205, 0.448, 
    2.027, 4.767, 1.514, 2.077), ZFHX3 = c(1.48, 1.38, 2.323, 
    1.039, 1.343, 1.354, 0.238, 1.224, 1.676, 0.811, 0.316, 2.012, 
    2.298, 1.869, 0.201, 0.176, 1.829, 1.081, 0.522, 0.959), 
    C3 = c(0.52, 0.527, 0, 0.073, 0, 0.15, 0.094, 0.315, 0.174, 
    0, 0, 0.17, 0.165, 0, 0.237, 0, 0.091, 0.095, 0, 0.081), 
    Clusters = c("C12", "C5", "C13", "C4", "C12", "C13", "C13", 
    "C4", "C6", "C8", "C4", "C4", "C4", "C12", "C5", "C6", "C1", 
    "C3", "C4", "C3")), row.names = c(NA, -20L), groups = structure(list(
    Clusters = c("C1", "C12", "C13", "C3", "C4", "C5", "C6", 
    "C8"), .rows = structure(list(17L, c(1L, 5L, 14L), c(3L, 
    6L, 7L), c(18L, 20L), c(4L, 8L, 11L, 12L, 13L, 19L), c(2L, 
    15L), c(9L, 16L), 10L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -8L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

Answer 1

你想要的是：

library(tidyverse)
df %>%
  group_by(Clusters) %>%
  summarize(across(everything(), ~mean(.[. > 0])))

~mean(. > 0) 检查元素是否大于 0，因此 returns TRUE/FALSE 然后为您提供基础 0/1 的平均值。相反，您想过滤可以使用通常的 [] 方法

实现的每一列

Dplyr：仅当行值 > 0 时，才使用 summarize across 取列的平均值

Dplyr: using summarise across to take mean of columns only if row value > 0

r

mean

dplyr

summarize

across