使用 tapply 为数据子集生成方差
Use tapply to generate variance for subsets of data
我有一个基因列表,每个基因有 1-3 个探针,每个探针都有一个强度值。举例如下:
GENE_ID Probes Intensity
GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.479375
GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.235625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.065625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.341875
GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.07125
GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.133125
GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.790625
GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.97375
GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.55125
我想确定每个单独基因的探针之间的方差(因此对于每个基因我都有一个方差值)
我知道我应该使用 tapply() 函数,但不知道如何实现,除了:
tapply( , , var)
您可以使用 data.table
或 dplyr
来完成此操作。这是一个经典的group_by
案例:
library(dplyr)
df %>%
group_by(GENE_ID) %>%
mutate(new_var = var(Intensity))
library(data.table)
setDT(df)
df[, new_var := var(Intensity), .(GENE_ID)]
两种情况下的输出都是:
GENE_ID Probes Intensity new_var
1: GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.4794 105228.6
2: GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.2356 105228.6
3: GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.0656 168802.8
4: GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.3419 168802.8
5: GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.0712 168802.8
6: GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.1331 NA
7: GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.7906 NA
8: GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.9738 6282014.8
9: GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.5513 6282014.8
这是基于 R 的经典 ave
案例。tapply
returns 是一个与分组因子的唯一值长度相等的向量,ave
returns 具有相同矢量长度 dataframe/matrix 列的分组平均值(或其他聚合)(按组重复值):
gene_df$Probes_var <- ave(gene_df$Intensity, gene_df$GENE_ID, FUN=var)
gene_df
# GENE_ID Probes Intensity Probes_var
# 1 GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.4794 105228.6
# 2 GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.2356 105228.6
# 3 GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.0656 168802.8
# 4 GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.3419 168802.8
# 5 GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.0712 168802.8
# 6 GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.1331 NA
# 7 GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.7906 NA
# 8 GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.9738 6282014.8
# 9 GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.5513 6282014.8
我有一个基因列表,每个基因有 1-3 个探针,每个探针都有一个强度值。举例如下:
GENE_ID Probes Intensity
GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.479375
GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.235625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.065625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.341875
GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.07125
GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.133125
GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.790625
GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.97375
GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.55125
我想确定每个单独基因的探针之间的方差(因此对于每个基因我都有一个方差值)
我知道我应该使用 tapply() 函数,但不知道如何实现,除了:
tapply( , , var)
您可以使用 data.table
或 dplyr
来完成此操作。这是一个经典的group_by
案例:
library(dplyr)
df %>%
group_by(GENE_ID) %>%
mutate(new_var = var(Intensity))
library(data.table)
setDT(df)
df[, new_var := var(Intensity), .(GENE_ID)]
两种情况下的输出都是:
GENE_ID Probes Intensity new_var
1: GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.4794 105228.6
2: GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.2356 105228.6
3: GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.0656 168802.8
4: GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.3419 168802.8
5: GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.0712 168802.8
6: GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.1331 NA
7: GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.7906 NA
8: GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.9738 6282014.8
9: GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.5513 6282014.8
这是基于 R 的经典 ave
案例。tapply
returns 是一个与分组因子的唯一值长度相等的向量,ave
returns 具有相同矢量长度 dataframe/matrix 列的分组平均值(或其他聚合)(按组重复值):
gene_df$Probes_var <- ave(gene_df$Intensity, gene_df$GENE_ID, FUN=var)
gene_df
# GENE_ID Probes Intensity Probes_var
# 1 GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.4794 105228.6
# 2 GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.2356 105228.6
# 3 GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.0656 168802.8
# 4 GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.3419 168802.8
# 5 GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.0712 168802.8
# 6 GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.1331 NA
# 7 GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.7906 NA
# 8 GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.9738 6282014.8
# 9 GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.5513 6282014.8