在 R 中应用 sapply 时如何保持数据帧格式?
How to keep data frame format when applying sapply in R?
我编写了一个函数,将我的数据框拆分为 3 列的序列,每列(代表样本重复)并对这些重复应用另一个函数。如果此重复序列中的三个样本中至少有两个高于特定阈值,则后者将所有值替换为 "NA",在本例中为 16。
这是示例代码:
# Install and load packages
if (!require(plyr)) install.packages('plyr')
library(plyr)
if (!require(dplyr)) install.packages('dplyr')
library(dplyr)
# Create example data frame
df <- data.frame (ID = c('data1', 'data2', 'data3'),
sample1 = c(2, 18, 3),
sample2 = c(4, 17, 16),
sample3 = c(3, 11, 2),
sample4 = c(22, 11, 35),
sample5 = c(10, 8, 22),
sample6 = c(17, 9, 11))
# Function for threshold settings
setThreshold <- function(df) {
thresholded_replicates <- data.frame(
sapply(split.default(df[2:ncol(df)],
rep(seq_along(df),
each = 3,
length.out = ncol(df)-1)
), function(df) {
df <- df %>%
mutate(rowsum = apply(df, 1, function(x) sum(x > 16))) %>%
mutate_at(1:ncol(df), funs(ifelse(rowsum < 2, NA, .))) %>%
select(-rowsum)
return(df)
}
))
return(thresholded_replicates)
}
df_th <- setThreshold(df)
输入数据框如下所示:
> df
ID sample1 sample2 sample3 sample4 sample5 sample6
1 data1 2 4 3 22 10 17
2 data2 18 17 11 11 8 9
3 data3 3 16 2 35 22 11
应用函数后数据框下方:
> df_th
X1 X2
sample1 NA, 18, NA 22, NA, 35
sample2 NA, 17, NA 10, NA, 22
sample3 NA, 11, NA 17, NA, 11
该函数运行良好,它将复制行中的所有值替换为 "NA",其中不包含至少两个大于 16 的值。但是,数据框的格式混淆了,生成的数据框应如下所示:
sample1 sample2 sample3 sample4 sample5 sample6
1 NA NA NA 22 10 17
2 18 17 11 NA NA NA
3 NA NA NA 35 22 11
如何实现?
这是完整的基础 R 版本,我们使用 lapply
和 rowSums
将行转换为 NA
。
do.call(cbind, lapply(split.default(df[2:ncol(df)], rep(seq_along(df), each = 3,
length.out = ncol(df)-1)), function(x) {x[rowSums(x > 16) < 2, ] <- NA;x}))
# 1.sample1 1.sample2 1.sample3 2.sample4 2.sample5 2.sample6
#1 NA NA NA 22 10 17
#2 18 17 11 NA NA NA
#3 NA NA NA 35 22 11
我编写了一个函数,将我的数据框拆分为 3 列的序列,每列(代表样本重复)并对这些重复应用另一个函数。如果此重复序列中的三个样本中至少有两个高于特定阈值,则后者将所有值替换为 "NA",在本例中为 16。
这是示例代码:
# Install and load packages
if (!require(plyr)) install.packages('plyr')
library(plyr)
if (!require(dplyr)) install.packages('dplyr')
library(dplyr)
# Create example data frame
df <- data.frame (ID = c('data1', 'data2', 'data3'),
sample1 = c(2, 18, 3),
sample2 = c(4, 17, 16),
sample3 = c(3, 11, 2),
sample4 = c(22, 11, 35),
sample5 = c(10, 8, 22),
sample6 = c(17, 9, 11))
# Function for threshold settings
setThreshold <- function(df) {
thresholded_replicates <- data.frame(
sapply(split.default(df[2:ncol(df)],
rep(seq_along(df),
each = 3,
length.out = ncol(df)-1)
), function(df) {
df <- df %>%
mutate(rowsum = apply(df, 1, function(x) sum(x > 16))) %>%
mutate_at(1:ncol(df), funs(ifelse(rowsum < 2, NA, .))) %>%
select(-rowsum)
return(df)
}
))
return(thresholded_replicates)
}
df_th <- setThreshold(df)
输入数据框如下所示:
> df
ID sample1 sample2 sample3 sample4 sample5 sample6
1 data1 2 4 3 22 10 17
2 data2 18 17 11 11 8 9
3 data3 3 16 2 35 22 11
应用函数后数据框下方:
> df_th
X1 X2
sample1 NA, 18, NA 22, NA, 35
sample2 NA, 17, NA 10, NA, 22
sample3 NA, 11, NA 17, NA, 11
该函数运行良好,它将复制行中的所有值替换为 "NA",其中不包含至少两个大于 16 的值。但是,数据框的格式混淆了,生成的数据框应如下所示:
sample1 sample2 sample3 sample4 sample5 sample6
1 NA NA NA 22 10 17
2 18 17 11 NA NA NA
3 NA NA NA 35 22 11
如何实现?
这是完整的基础 R 版本,我们使用 lapply
和 rowSums
将行转换为 NA
。
do.call(cbind, lapply(split.default(df[2:ncol(df)], rep(seq_along(df), each = 3,
length.out = ncol(df)-1)), function(x) {x[rowSums(x > 16) < 2, ] <- NA;x}))
# 1.sample1 1.sample2 1.sample3 2.sample4 2.sample5 2.sample6
#1 NA NA NA 22 10 17
#2 18 17 11 NA NA NA
#3 NA NA NA 35 22 11