R:计算每列满足条件的次数并且行名出现在列表中

R: count times per column a condition is met and row names appear in a list

我有一个包含计数信息的数据框 (df1)

rownames sample1 sample2 sample3
m1 0 5 1
m2 1 7 5
m3 6 2 0
m4 3 1 0

第二个带有示例信息 (df2)

rownames batch total count
sample1 a 10
sample2 b 15
sample3 a 6

我还有两个列表,其中包含有关 m 值的信息(如有必要,可以很容易地转换为另一个数据框,但我不想添加到计数信息中,因为它非常大)。不存在模式(例如偶数和奇数),我只是使用一个非常简单的例子

x <- c("m1", "m3")y <- c("m2", "m4")

我想做的是在示例信息中再添加两列。这是每个样本中每个 m 值大于 5 并出现在列表 x 或 y

中的计数
rownames batch total count x y
sample1 a 10 1 0
sample2 b 15 1 1
sample3 a 6 0 1

我目前的策略是为 x 和 y 创建一个值列表,然后将它们附加到 df2。这是我到目前为止的尝试:

numX <- colSums(df1[sum(rownames(df1)>10 %in% x),])numX <- colSums(df1[sum(rownames(df1)>10 %in% x),]) 都是 return 0 列表

numX <- colSums(df1[rownames(df1)>10 %in% x,]) returns 满足每列条件的计数值之和的列表

numX <- length(df1[rownames(df1)>10 %in% novel,])returns满足条件的次数(本例为2L)

我不太确定如何处理这个问题,所以我一直在尝试。我试过寻找答案,但也许我只是在努力寻找合适的措辞。

如何使用 dplyrreshape2::melt

df3 <- df1 %>%
  melt %>%
  filter(value >= 5) %>% 
  mutate(x = as.numeric(rownames %in% c("m1", "m3")),
         y = as.numeric(rownames %in% c("m2", "m4"))) %>%
  select(-rownames, - value) %>%
  group_by(variable) %>%
  summarise(x = sum(x), y = sum(y))

df2 %>% left_join(df3, by = c("rownames" = "variable"))

  rownames batch total_count x y
1  sample1     a          10 1 0
2  sample2     b          15 1 1
3  sample3     a           6 0 1

您可以创建一个命名的向量列表,并为每个 rownames 计算 xy 在各自的 sample 中有多少个值是 >= 5.

基础 R 选项 -

list_vec <- list(x = x, y = y)

cbind(df2, do.call(rbind, lapply(df2$rownames, function(x) 
  sapply(list_vec, function(y) {
    sum(df1[[x]][df1$rownames %in% y] >= 5)
}))))

#  rownames batch total.count x y
#1  sample1     a          10 1 0
#2  sample2     b          15 1 1
#3  sample3     a           6 0 1

使用tidyverse-

library(dplyr)
library(purrr)

list_vec <- lst(x, y)

df2 %>%
  bind_cols(map_df(df2$rownames, function(x) 
    map(list_vec, ~sum(df1[[x]][df1$rownames %in% .x] >= 5))))

我们可以用 rowwise

library(dplyr)
df2 %>% 
   rowwise %>%
    mutate(x = +(sum(df1[[rownames]][df1$rownames %in% x]) >= 5), 
           y = +(sum(df1[[rownames]][df1$rownames %in% y]) >= 5)) %>%
    ungroup

-输出

# A tibble: 3 × 5
  rownames batch totalcount     x     y
  <chr>    <chr>      <int> <int> <int>
1 sample1  a             10     1     0
2 sample2  b             15     1     1
3 sample3  a              6     0     1

或者根据数据,base R 选项将是

out <- aggregate(. ~ grp, FUN = sum, 
     transform(df1,  grp = c('x', 'y')[1 + (rownames %in% y)] )[-1])
df2[out$grp] <- +(t(out[-1]) >= 5)

-输出

> df2
  rownames batch totalcount x y
1  sample1     a         10 1 0
2  sample2     b         15 1 1
3  sample3     a          6 0 1

数据

df1 <- structure(list(rownames = c("m1", "m2", "m3", "m4"), sample1 = c(0L, 
1L, 6L, 3L), sample2 = c(5L, 7L, 2L, 1L), sample3 = c(1L, 5L, 
0L, 0L)), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(rownames = c("sample1", "sample2", "sample3"), 
    batch = c("a", "b", "a"), totalcount = c(10L, 15L, 6L)), 
class = "data.frame", row.names = c(NA, 
-3L))