计算一系列列的 rowMeans(变量数)
Calculate rowMeans on a range of column (Variable number)
我想计算一系列列的 rowMeans,但我无法将列名(例如 c(C1,C3))或范围(例如 C1:C3)的硬编码值同时作为名称和范围是可变的。我的 df 看起来像:
> df
chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
1 123 abc 12 10.00 19 18.00 12 13.00 -14
2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
这只是一个示例,实际上我的列范围为 MGW.1 ... MGW.196 等等。在这里,我不想给出确切的 colnames 或确切的范围,而是想传递 colnames 的首字母,并希望获得具有该首字母的所有列的平均值。类似于:MGW=rowMeans(df[,MGW.*]), HEL=rowMeans(df[,HEL.*])
所以我的最终输出应该是这样的:
> df
chr name age MGW Hel
1 123 abc 12 10.00 19
2 234 bvf 24 13.29 13
3 376 bxc 17 -6.95 10
我知道这些值不正确,但这只是为了给你和想法。 其次,我想从数据框中删除所有那些在整行中包含 NA 的行,除了前 3 个值。
这是示例的输入:
> dput(df)
structure(list(chr = c(123L, 234L, 376L), name = structure(1:3, .Label = c("abc",
"bvf", "bxc"), class = "factor"), age = c(12L, 24L, 17L), MGW.1 = c(10,
-13.29, -6.95), MGW.2 = c(19L, 13L, 10L), MGW.3 = c(18, -3.02,
-18), HEL.1 = c(12L, 12L, 15L), HEL.2 = c(13, -0.12, 4), HEL.3 = c(-14L,
24L, -4L)), .Names = c("chr", "name", "age", "MGW.1", "MGW.2",
"MGW.3", "HEL.1", "HEL.2", "HEL.3"), class = "data.frame", row.names = c(NA,
-3L))
首先
我想你正在寻找这个来获取行的平均值:
df$mean.Hel <- rowMeans(df[, grep("^HEL.", names(df))])
然后删除列:
df[, grep("^HEL.", names(df))] <- NULL
其次
删除前三个元素后只有 NA
的行。
rows.delete <- which(rowSums(!is.na(df)[,4:ncol(df)]) == 0)
df <- df[!(1:nrow(df) %in% rows.delete),]
这是一个无需硬编码变量名称即可实现所需输出的想法:
library(dplyr)
library(tidyr)
df %>%
# remove rows where all values are NA except the first 3 columns
filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
# gather the data in a tidy format
gather(key, value, -(chr:age)) %>%
# separate the key column into label and num allowing
# to regroup by variables without hardcoding them
separate(key, into = c("label", "num")) %>%
group_by(chr, name, age, label) %>%
# calculate the mean
summarise(mean = mean(value, na.rm = TRUE)) %>%
spread(label, mean)
我冒昧地修改了您的初始数据以显示逻辑如何适合特殊情况。例如,这里我们有一行 (#4),其中除了前 3 列之外的所有值都是 NA
s(根据您的要求,应该删除这一行)和混合 NA
s 和值(#5)。在这种情况下,我假设我们希望得到 MGW
的结果,因为 MGW.1
:
处有一个值
# chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
#1 123 abc 12 10.00 19 18.00 12 13.00 -14
#2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
#3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
#4 999 zzz 21 NA NA NA NA NA NA
#5 888 aaa 12 10.00 NA NA NA NA NA
给出:
#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
# chr name age HEL MGW
#* <int> <fctr> <int> <dbl> <dbl>
#1 123 abc 12 3.666667 15.666667
#2 234 bvf 24 11.960000 -1.103333
#3 376 bxc 17 5.000000 -4.983333
#4 888 aaa 12 NaN 10.000000
数据
df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L,
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10,
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18,
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13,
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr",
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
我想计算一系列列的 rowMeans,但我无法将列名(例如 c(C1,C3))或范围(例如 C1:C3)的硬编码值同时作为名称和范围是可变的。我的 df 看起来像:
> df
chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
1 123 abc 12 10.00 19 18.00 12 13.00 -14
2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
这只是一个示例,实际上我的列范围为 MGW.1 ... MGW.196 等等。在这里,我不想给出确切的 colnames 或确切的范围,而是想传递 colnames 的首字母,并希望获得具有该首字母的所有列的平均值。类似于:MGW=rowMeans(df[,MGW.*]), HEL=rowMeans(df[,HEL.*])
所以我的最终输出应该是这样的:
> df
chr name age MGW Hel
1 123 abc 12 10.00 19
2 234 bvf 24 13.29 13
3 376 bxc 17 -6.95 10
我知道这些值不正确,但这只是为了给你和想法。 其次,我想从数据框中删除所有那些在整行中包含 NA 的行,除了前 3 个值。
这是示例的输入:
> dput(df)
structure(list(chr = c(123L, 234L, 376L), name = structure(1:3, .Label = c("abc",
"bvf", "bxc"), class = "factor"), age = c(12L, 24L, 17L), MGW.1 = c(10,
-13.29, -6.95), MGW.2 = c(19L, 13L, 10L), MGW.3 = c(18, -3.02,
-18), HEL.1 = c(12L, 12L, 15L), HEL.2 = c(13, -0.12, 4), HEL.3 = c(-14L,
24L, -4L)), .Names = c("chr", "name", "age", "MGW.1", "MGW.2",
"MGW.3", "HEL.1", "HEL.2", "HEL.3"), class = "data.frame", row.names = c(NA,
-3L))
首先
我想你正在寻找这个来获取行的平均值:
df$mean.Hel <- rowMeans(df[, grep("^HEL.", names(df))])
然后删除列:
df[, grep("^HEL.", names(df))] <- NULL
其次
删除前三个元素后只有 NA
的行。
rows.delete <- which(rowSums(!is.na(df)[,4:ncol(df)]) == 0)
df <- df[!(1:nrow(df) %in% rows.delete),]
这是一个无需硬编码变量名称即可实现所需输出的想法:
library(dplyr)
library(tidyr)
df %>%
# remove rows where all values are NA except the first 3 columns
filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
# gather the data in a tidy format
gather(key, value, -(chr:age)) %>%
# separate the key column into label and num allowing
# to regroup by variables without hardcoding them
separate(key, into = c("label", "num")) %>%
group_by(chr, name, age, label) %>%
# calculate the mean
summarise(mean = mean(value, na.rm = TRUE)) %>%
spread(label, mean)
我冒昧地修改了您的初始数据以显示逻辑如何适合特殊情况。例如,这里我们有一行 (#4),其中除了前 3 列之外的所有值都是 NA
s(根据您的要求,应该删除这一行)和混合 NA
s 和值(#5)。在这种情况下,我假设我们希望得到 MGW
的结果,因为 MGW.1
:
# chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
#1 123 abc 12 10.00 19 18.00 12 13.00 -14
#2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
#3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
#4 999 zzz 21 NA NA NA NA NA NA
#5 888 aaa 12 10.00 NA NA NA NA NA
给出:
#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
# chr name age HEL MGW
#* <int> <fctr> <int> <dbl> <dbl>
#1 123 abc 12 3.666667 15.666667
#2 234 bvf 24 11.960000 -1.103333
#3 376 bxc 17 5.000000 -4.983333
#4 888 aaa 12 NaN 10.000000
数据
df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L,
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10,
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18,
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13,
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr",
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))