如何提取方差最大的 100 列?
How do I extract the 100 columns with the highest variance?
我得到了一个包含 4983 行和 369 列的列表对象。每一列都是一个不同的样本,每一行都是该样本的一个值。
现在我需要提取其行中方差最大的 100 个样本,但我不知道该怎么做 ..
仅使用 20 行和 5 列的示例,返回具有最高可变性的两列:
# some example data:
dat <- data.frame(var1 = rnorm(n=20, mean = 1, sd=4),
var2 = rnorm(n=20, mean = 1, sd=3),
var3 = rnorm(n=20, mean = 1, sd=2),
var4 = rnorm(n=20, mean = 1, sd=8),
var5 = rnorm(n=20, mean = 1, sd=6))
head(dat)
# calculate variance per column
variances <- apply(X=dat, MARGIN=2, FUN=var)
# sort variance, grab index of the first 2
sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:2] # replace 2 with 100 ...
# use that to subset the original data
dat.highvariance <- dat[, sorted]
dat.highvariance
我的代码与 "Where's my towel" 完全相同,但使用 Rfast 包更快
# some example data:
dat <- data.frame(var1 = rnorm(n=20, mean = 1, sd=4),
var2 = rnorm(n=20, mean = 1, sd=3),
var3 = rnorm(n=20, mean = 1, sd=2),
var4 = rnorm(n=20, mean = 1, sd=8),
var5 = rnorm(n=20, mean = 1, sd=6))
MaxVars_R<-function(dat,n){
head(dat)
# calculate variance per column
variances <- apply(X=dat, MARGIN=2, FUN=var)
# sort variance, grab index of the first 2
sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:n]
# use that to subset the original data
dat.highvariance <- dat[, sorted]
dat.highvariance
}
MaxVars<-function(dat,n,parallel = FALSE){
x<-Rfast::data.frame.to_matrix(dat)
variances<-Rfast::colVars(x,parallel = parallel)
indices<-Rfast::Order(variances,descending = TRUE,partial = n)[1:n]
dat[,indices]
}
all.equal(MaxVars(dat,2),MaxVars_R(dat,2))
我得到了一个包含 4983 行和 369 列的列表对象。每一列都是一个不同的样本,每一行都是该样本的一个值。
现在我需要提取其行中方差最大的 100 个样本,但我不知道该怎么做 ..
仅使用 20 行和 5 列的示例,返回具有最高可变性的两列:
# some example data:
dat <- data.frame(var1 = rnorm(n=20, mean = 1, sd=4),
var2 = rnorm(n=20, mean = 1, sd=3),
var3 = rnorm(n=20, mean = 1, sd=2),
var4 = rnorm(n=20, mean = 1, sd=8),
var5 = rnorm(n=20, mean = 1, sd=6))
head(dat)
# calculate variance per column
variances <- apply(X=dat, MARGIN=2, FUN=var)
# sort variance, grab index of the first 2
sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:2] # replace 2 with 100 ...
# use that to subset the original data
dat.highvariance <- dat[, sorted]
dat.highvariance
我的代码与 "Where's my towel" 完全相同,但使用 Rfast 包更快
# some example data:
dat <- data.frame(var1 = rnorm(n=20, mean = 1, sd=4),
var2 = rnorm(n=20, mean = 1, sd=3),
var3 = rnorm(n=20, mean = 1, sd=2),
var4 = rnorm(n=20, mean = 1, sd=8),
var5 = rnorm(n=20, mean = 1, sd=6))
MaxVars_R<-function(dat,n){
head(dat)
# calculate variance per column
variances <- apply(X=dat, MARGIN=2, FUN=var)
# sort variance, grab index of the first 2
sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:n]
# use that to subset the original data
dat.highvariance <- dat[, sorted]
dat.highvariance
}
MaxVars<-function(dat,n,parallel = FALSE){
x<-Rfast::data.frame.to_matrix(dat)
variances<-Rfast::colVars(x,parallel = parallel)
indices<-Rfast::Order(variances,descending = TRUE,partial = n)[1:n]
dat[,indices]
}
all.equal(MaxVars(dat,2),MaxVars_R(dat,2))