根据数据框的列值 R 之一求和并求出数据框行中所有值的平均值

Question

我有创建所需输出的代码；但是，它的速度非常慢。我有两个输入数据集（metaClustering_perCell、data_clean）。 data_clean 的每一行索引对应于 metaClustering_per 单元格的索引位置。这是两个数据集的示例。

dput(head(data_clean[1:5],10))

structure(
  list(
    `NA` = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
    EGFP.A = c(326, 314, 341, 0, 198, 295, 325, 309, 400, 328),
    CD43.PE.A = c(435, 402, 469, 283, 303, 371, 442, 363, 444, 358),
    CD45.PE.Vio770.A = c(399, 385, 379, 438, 384, 331, 402, 392, 354, 430),
    CD235a_41a.APC.A = c(412, 618, 239, 562, 661, 193, 363, 385, 408, 265),
    APC.Vio770.A = c(447, 491, 444, 437, 477, 328, 453, 326, 353, 0)
  ),
  row.names = c(NA, -10L),
  class = "data.frame"
)

NA	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0

dput(head(metaClustering_perCell,10))

c("1 Population", "1 Population", "1 Population", "1 Population", "1 Population",
"1 Population", "1 Population", "1 Population", "1 Population", "9 Population")

我希望最终用标记的平均值制作热图 (EGFP.A, CD43.PE.A.....) 但是，我的数据集将包含几乎 2e8 个排序的单元格成预定数量的人群。此处显示了我编写的代码，它创建了 2 个空数据帧。 df_sum 存储标记（EGFP.A、CD43.PE.A.....）的运行总和，而 df_count 则对运行进行计数每个群体中的总事件。最后，代码通过将数据帧除以向量来取平均值。代码在这里。

# create an empty matrix
df_sum  <- data.frame(matrix(ncol = length(data_clean), nrow = num_clusters))
pops_header <- unique(metaClustering_perCell)
rownames(df_sum) <- pops_header
colnames(df_sum) <- colnames(data_clean)

# creates empty table for storing the count values
df_count <- data.frame(matrix(ncol = num_clusters, nrow = 1))
colnames(df_count) <- pops_header



df[is.na(df_sum)] <- 0
df_count[is.na(df_count)] <- 0



for (i in 1:length(metaClustering_perCell)){

  # only takes one row at a time of original data
  volt_vals <- data_clean[i,]
  
  # find the column to place it in (population)
  pop <- metaClustering_perCell[i]
  
  # Tally for each population
  df_count[1,pop] <- df_count[1,pop] + 1
  
  # adds to the previous value in the dataframe
  for (a in colnames(volt_vals)){
    df_sum[pop, a] <- volt_vals[a] + df_sum[pop, a]
  }
    
  # creates another dataframe same size as df to overwrite with the averages
  df_aves <- df_sum
  
  
  # Divide the df_=
  for (n in pops_header){
    df_aves[n,] <- mapply('/', df_sum[n,], df_count[n])
  }
}

我得到的输出是这样的（我剪掉了它们以便更容易看到）

>head(df_sum[1:3],10)

NA	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A
1 Population	26062897	35936578	32784372.
9 Population	1045468	1591084	1576716.
2 Population	4374137	8673145	6555053.
8 Population	818413	44836	1318176.
5 Population	217605	443341	439357.
6 Population	1056157	1558711	43206.
7 Population	747037	883763	1134664.
3 Population	1561994	2376586	2329772.
4 Population	54940	9346	137085.
10 Population	172735	213079	8043.

>head(df_count[1:5])

Population 9	Population 2	Population 8	Population 5	Population
78909	4262	12982	4447	1392

> head(df_aves[1:3], 10)

NA	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A
1 Population	330.2905	455.41799	415.470631
9 Population	245.2999	373.31863	369.947443
2 Population	336.9386	668.09005	504.933986
8 Population	184.0371	10.08230	296.419159
5 Population	156.3254	318.49210	315.630029
6 Population	235.1195	346.99711	9.618433
7 Population	186.1079	220.17015	282.676632
3 Population	256.1906	389.79597	382.117763
4 Population	160.1749	27.24781	399.664723
10 Population	201.5578	248.63361	9.385064

每个人口的平均值数据框及其每个列的值 headers（标记）正是我想要的......然而，它非常慢......和我的意思是残酷的。这是我使用 R 的第一周（我是从堆栈中自学 python 的），所以请详细解释。感谢您的帮助。

Answer 1

不清楚你到底想达到什么目的，而且样本数据太稀疏，无法帮助消除歧义，但这是我的两个猜测：

每个群体中每个标记的平均值

这种解释与您的示例输出最一致，其中每个总体（集群）只出现一次，就好像数据是按总体汇总的一样。

在 R 中对数据进行分组然后使用聚合函数对其进行汇总非常简单。

解决方案 1.1：`dplyr`

这是 dplyr 包的解决方案，语法直观:

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and summarize the average of each marker (column).
  summarize(across(everything(), mean))

解决方案 1.2：`data.table`

这是一个 data.table 的解决方案，它提供 更好的性能。

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Summarize the average of each marker (column), as grouped by cluster.
  lapply(.SD, mean), by = `NA`
]

结果

让 data_clean 和 metaClustering_perCell 的值与您问题中的样本相同。

虽然第一个结果 (1.1) 将是 tibble 而第二个 (1.2) 将是 data.table，每一个都会包含以下数据：

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

每次观察的累积平均值 ("")

这种解释与您的算法最一致，该算法似乎在运行的基础上为每个观察（行）计算其指标（平均值等）。

R 还有助于累积平均值、总和等。 far 更有效地利用 vectorized operations than to compute these metrics iteratively (with loops, the *apply() 系列等）每一行。

解决方案 2.1：`dplyr`

偶然发现，dplyr已经有了自己的cummean()功能。

library(dplyr)

data_clean %>%
  # Overwrite the 'NA' column with the cluster labels.
  mutate(`NA` = metaClustering_perCell) %>%
  # Group by cluster labels...
  group_by(`NA`) %>%
  # ...and overwrite each marker (column) with its running average.
  mutate(across(everything(), cummean)) %>% ungroup()

解决方案 2.2：`data.table`

使用data.table我们可以即兴创作我们自己的(anonymous)函数

function(x) {
  cumsum(x) / seq_along(x)
}

将运行总和除以运行计数，以计算沿向量（列）的累积平均值。我们还可以导入 dplyr 并使用 cummean 代替我们的函数。

library(data.table)

as.data.table(data_clean)[,
  # Overwrite the 'NA' column with the cluster labels.
  ("NA") := metaClustering_perCell
][,
  # Overwrite each marker (column) with its running average, as grouped by cluster.
  lapply(.SD, function(x)cumsum(x)/seq_along(x)), by = `NA`
]

结果

让 data_clean 和 metaClustering_perCell 的值与您问题中的样本相同。

虽然第一个结果 (1.1) 将是 tibble 而第二个 (1.2) 将是 data.table，每一个都会包含以下数据：

          NA   EGFP.A CD43.PE.A CD45.PE.Vio770.A CD235a_41a.APC.A APC.Vio770.A
1 Population 326.0000  435.0000         399.0000         412.0000     447.0000
1 Population 320.0000  418.5000         392.0000         515.0000     469.0000
1 Population 327.0000  435.3333         387.6667         423.0000     460.6667
1 Population 245.2500  397.2500         400.2500         457.7500     454.7500
1 Population 235.8000  378.4000         397.0000         498.4000     459.2000
1 Population 245.6667  377.1667         386.0000         447.5000     437.3333
1 Population 257.0000  386.4286         388.2857         435.4286     439.5714
1 Population 263.5000  383.5000         388.7500         429.1250     425.3750
1 Population 278.6667  390.2222         384.8889         426.7778     417.3333
9 Population 328.0000  358.0000         430.0000         265.0000       0.0000

根据数据框的列值 R 之一求和并求出数据框行中所有值的平均值

Sum and find average of all the value's in a data frame row based upon one of the data frame's column value R

sorting

average

r

dataframe

每个群体中每个标记的平均值

解决方案 1.1：`dplyr`

解决方案 1.2：`data.table`

结果

每次观察的累积平均值 ("")

解决方案 2.1：`dplyr`

解决方案 2.2：`data.table`

结果

NA	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0

NA	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0

根据数据框的列值 R 之一求和并求出数据框行中所有值的平均值

Sum and find average of all the value's in a data frame row based upon one of the data frame's column value R

sorting

average

r

dataframe

每个群体中每个标记的平均值

解决方案 1.1：dplyr

解决方案 1.2：data.table

结果

每次观察的累积平均值 ("")

解决方案 2.1：dplyr

解决方案 2.2：data.table

结果

解决方案 1.1：`dplyr`

解决方案 1.2：`data.table`

解决方案 2.1：`dplyr`

解决方案 2.2：`data.table`

NA	EGFP.A	CD43.PE.A	CD45.PE.Vio770.A	CD235a_41a.APC.A	APC.Vio770.A
1	326	435	399	412	447
2	314	402	385	618	491
3	341	469	379	239	444
4	0	283	438	562	437
5	198	303	384	661	477
6	295	371	331	193	328
7	325	442	402	363	453
8	309	363	392	385	326
9	400	444	354	408	353
10	328	358	430	265	0