根据数据框的列值 R 之一求和并求出数据框行中所有值的平均值

Sum and find average of all the value's in a data frame row based upon one of the data frame's column value R

我有创建所需输出的代码;但是,它的速度非常慢。我有两个输入数据集(metaClustering_perCelldata_clean)。 data_clean 的每一行索引对应于 metaClustering_per 单元格的索引位置。这是两个数据集的示例。


    `NA` = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
    EGFP.A = c(326, 314, 341, 0, 198, 295, 325, 309, 400, 328),
    CD43.PE.A = c(435, 402, 469, 283, 303, 371, 442, 363, 444, 358),
    CD45.PE.Vio770.A = c(399, 385, 379, 438, 384, 331, 402, 392, 354, 430),
    CD235a_41a.APC.A = c(412, 618, 239, 562, 661, 193, 363, 385, 408, 265),
    APC.Vio770.A = c(447, 491, 444, 437, 477, 328, 453, 326, 353, 0)
  row.names = c(NA, -10L),
  class = "data.frame"
NA EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 326 435 399 412 447
2 314 402 385 618 491
3 341 469 379 239 444
4 0 283 438 562 437
5 198 303 384 661 477
6 295 371 331 193 328
7 325 442 402 363 453
8 309 363 392 385 326
9 400 444 354 408 353
10 328 358 430 265 0

c("1 Population", "1 Population", "1 Population", "1 Population", "1 Population",
"1 Population", "1 Population", "1 Population", "1 Population", "9 Population")

我希望最终用标记的平均值制作热图 (EGFP.A, CD43.PE.A.....) 但是,我的数据集将包含几乎 2e8 个排序的单元格成预定数量的人群。此处显示了我编写的代码,它创建了 2 个空数据帧。 df_sum 存储标记(EGFP.A、CD43.PE.A.....)的 运行 总和,而 df_count 则对 运行 进行计数每个群体中的总事件。最后,代码通过将数据帧除以向量来取平均值。代码在这里。

# create an empty matrix
df_sum  <- data.frame(matrix(ncol = length(data_clean), nrow = num_clusters))
pops_header <- unique(metaClustering_perCell)
rownames(df_sum) <- pops_header
colnames(df_sum) <- colnames(data_clean)

# creates empty table for storing the count values
df_count <- data.frame(matrix(ncol = num_clusters, nrow = 1))
colnames(df_count) <- pops_header

df[is.na(df_sum)] <- 0
df_count[is.na(df_count)] <- 0

for (i in 1:length(metaClustering_perCell)){

  # only takes one row at a time of original data
  volt_vals <- data_clean[i,]
  # find the column to place it in (population)
  pop <- metaClustering_perCell[i]
  # Tally for each population
  df_count[1,pop] <- df_count[1,pop] + 1
  # adds to the previous value in the dataframe
  for (a in colnames(volt_vals)){
    df_sum[pop, a] <- volt_vals[a] + df_sum[pop, a]
  # creates another dataframe same size as df to overwrite with the averages
  df_aves <- df_sum
  # Divide the df_=
  for (n in pops_header){
    df_aves[n,] <- mapply('/', df_sum[n,], df_count[n])

我得到的输出是这样的 (我剪掉了它们以便更容易看到)

NA EGFP.A CD43.PE.A CD45.PE.Vio770.A
1 Population 26062897 35936578 32784372.
9 Population 1045468 1591084 1576716.
2 Population 4374137 8673145 6555053.
8 Population 818413 44836 1318176.
5 Population 217605 443341 439357.
6 Population 1056157 1558711 43206.
7 Population 747037 883763 1134664.
3 Population 1561994 2376586 2329772.
4 Population 54940 9346 137085.
10 Population 172735 213079 8043.
Population 9 Population 2 Population 8 Population 5 Population
78909 4262 12982 4447 1392
> head(df_aves[1:3], 10)
NA EGFP.A CD43.PE.A CD45.PE.Vio770.A
1 Population 330.2905 455.41799 415.470631
9 Population 245.2999 373.31863 369.947443
2 Population 336.9386 668.09005 504.933986
8 Population 184.0371 10.08230 296.419159
5 Population 156.3254 318.49210 315.630029
6 Population 235.1195 346.99711 9.618433
7 Population 186.1079 220.17015 282.676632
3 Population 256.1906 389.79597 382.117763
4 Population 160.1749 27.24781 399.664723
10 Population 201.5578 248.63361 9.385064

每个人口的平均值数据框及其每个列的值 headers(标记)正是我想要的......然而,它非常慢......和我的意思是残酷的。这是我使用 R 的第一周(我是从堆栈中自学 python 的),所以请详细解释。感谢您的帮助。




在 R 中对数据进行分组然后使用聚合函数对其进行汇总非常简单。

解决方案 1.1:dplyr

这是 dplyr 包的解决方案,语法直观:


data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and summarize the average of each marker (column).
  summarize(across(everything(), mean))

解决方案 1.2:data.table

这是一个 data.table 的解决方案,它提供 更好的性能


  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
  # Summarize the average of each marker (column), as grouped by cluster.
  lapply(.SD, mean), by = `NA`


data_cleanmetaClustering_perCell 的值与您问题中的样本相同。

虽然第一个结果 (1.1) 将是 tibble 而第二个 (1.2) 将是 data.table,每一个都会包含以下数据:

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

每次观察的累积平均值 ("")

这种解释与您的算法最一致,该算法似乎在 运行 的基础上为每个观察(行)计算其指标(平均值等)。

R 还有助于累积平均值、总和等。 far 更有效地利用 vectorized operations than to compute these metrics iteratively (with loops, the *apply() 系列等)每一行。

解决方案 2.1:dplyr



data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and overwrite each marker (column) with its running average.
  mutate(across(everything(), cummean)) %>% ungroup()

解决方案 2.2:data.table


function(x) {
  cumsum(x) / seq_along(x)

将 运行 总和除以 运行 计数,以计算沿向量(列)的累积平均值。我们还可以导入 dplyr 并使用 cummean 代替我们的函数。


  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
  # Overwrite each marker (column) with its running average, as grouped by cluster.
  lapply(.SD, function(x)cumsum(x)/seq_along(x)), by = `NA`


data_cleanmetaClustering_perCell 的值与您问题中的样本相同。

虽然第一个结果 (1.1) 将是 tibble 而第二个 (1.2) 将是 data.table,每一个都会包含以下数据:

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 326.0000  435.0000         399.0000         412.0000     447.0000
1 Population 320.0000  418.5000         392.0000         515.0000     469.0000
1 Population 327.0000  435.3333         387.6667         423.0000     460.6667
1 Population 245.2500  397.2500         400.2500         457.7500     454.7500
1 Population 235.8000  378.4000         397.0000         498.4000     459.2000
1 Population 245.6667  377.1667         386.0000         447.5000     437.3333
1 Population 257.0000  386.4286         388.2857         435.4286     439.5714
1 Population 263.5000  383.5000         388.7500         429.1250     425.3750
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000