计算数据框中组行的频率

Calculate frequency for group-rows within data frame

我正在寻找一种方法将我的数据框转换为下面 out 中显示的格式。基本上,对于每个物种,我将展示它在每个 group 中出现的频率 (Freq),频率计算为计数 <0 的样本与计数 >0 的样本的比率。如果我有 10 个样本并且 3 个样本的计数 > 0,则比率为 0.3。此外,我想要一个列,其中包含计数 >0 的样本的绝对数量。

我试过 dplyr::mutate,我认为它应该有用。

> df
         sample1 sample2 sample3 sample4
Species1       2      12      52     221
Species2       0      13       0       0
Species3       5       0       0      25
Species4       0       0       0       0
Group        Gr1     Gr1     Gr2     Gr2


> dput(df)
structure(list(sample1 = c("2", "0", "5", "0", "Gr1"), sample2 = c("12", 
"13", "0", "0", "Gr1"), sample3 = c("52", "0", "0", "0", "Gr2"
), sample4 = c("221", "0", "25", "0", "Gr2")), class = "data.frame", row.names = c("Species1", 
"Species2", "Species3", "Species4", "Group"))





 out

Species Group Freq Absolute
Species1 Gr1 1 2
Species1 Gr2 1 2
Species2 Gr1 0.5 1
Species2 Gr2 0 0
Species3 Gr1 0.5 1
Species3 Gr2 0.5 1
Species4 Gr1 0 0
Species4 Gr2 0 0
 

我不确定我是否理解你的问题,但这里有一些代码可以将数据重塑为更易于使用的格式。

library(tidyverse)
df %>%
  rownames_to_column("Species") %>%
  pivot_longer(-Species) %>%
  group_by(name) %>%
  mutate(Group = last(value)) %>%
  filter(Species != "Group") %>%
  mutate(value = as.numeric(value)) %>%
  ungroup()

产生:

# A tibble: 16 x 4
   Species  name    value Group
   <chr>    <chr>   <dbl> <chr>
 1 Species1 sample1     2 Gr1  
 2 Species1 sample2    12 Gr1  
 3 Species1 sample3    52 Gr2  
 4 Species1 sample4   221 Gr2  
 5 Species2 sample1     0 Gr1  
 6 Species2 sample2    13 Gr1  
 7 Species2 sample3     0 Gr2  
 8 Species2 sample4     0 Gr2  
 9 Species3 sample1     5 Gr1  
10 Species3 sample2     0 Gr1  
11 Species3 sample3     0 Gr2  
12 Species3 sample4    25 Gr2  
13 Species4 sample1     0 Gr1  
14 Species4 sample2     0 Gr1  
15 Species4 sample3     0 Gr2  
16 Species4 sample4     0 Gr2  

您能否详细描述一下您要应用的逻辑?我可以巧合地得到你用下面的代码描述的输出,但我不确定它是否符合你希望它的工作方式。 “/2”是否应该根据数据中 Groups 的数量而改变?

[code above] %>% 
  count(Species, Group, wt = value > 0, name = "Absolute") %>%
  mutate(Freq = Absolute / 2)


# A tibble: 8 x 4
  Species  Group Absolute  Freq
  <chr>    <chr>    <int> <dbl>
1 Species1 Gr1          2   1  
2 Species1 Gr2          2   1  
3 Species2 Gr1          1   0.5
4 Species2 Gr2          0   0  
5 Species3 Gr1          1   0.5
6 Species3 Gr2          1   0.5
7 Species4 Gr1          0   0  
8 Species4 Gr2          0   0  

这里的问题是,虽然 df 在技术上是一个数据框,但它的结构不是很好。数据框应该每个变量一列,每个观察值一行。如果先转置您的数据会更有意义:

library(tibble)
library(dplyr)

df <- rownames_to_column(as.data.frame(t(df)), "sample")

df[2:5] <- lapply(df[2:5], as.numeric)

df

#>    sample Species1 Species2 Species3 Species4 Group
#> 1 sample1        2        0        5        0   Gr1
#> 2 sample2       12       13        0        0   Gr1
#> 3 sample3       52        0        0        0   Gr2
#> 4 sample4      221        0       25        0   Gr2

现在我们可以旋转 Species 使其成为自己的列,并且可以直接进行所需的计算:

tidyr::pivot_longer(df, 2:5) %>%
  group_by(name, Group) %>%
  summarise(absolute = sum(value > 0),
            Freq = absolute / length(name))

#> # A tibble: 8 x 4
#> # Groups:   name [4]
#>   name     Group absolute  Freq
#>   <chr>    <chr>    <int> <dbl>
#> 1 Species1 Gr1          2   1  
#> 2 Species1 Gr2          2   1  
#> 3 Species2 Gr1          1   0.5
#> 4 Species2 Gr2          0   0  
#> 5 Species3 Gr1          1   0.5
#> 6 Species3 Gr2          1   0.5
#> 7 Species4 Gr1          0   0  
#> 8 Species4 Gr2          0   0  

选项data.table

library(data.table)
melt(type.convert(data.table::transpose(setDT(df, 
   keep.rownames = TRUE), make.names = 'rn'), as.is = TRUE),
    id.var = 'Group', variable.name = 'Species')[, 
   .(Absolute = sum(value > 0)), .(Group, Species)][, Freq := Absolute/2][]
   Group  Species Absolute Freq
1:   Gr1 Species1        2  1.0
2:   Gr2 Species1        2  1.0
3:   Gr1 Species2        1  0.5
4:   Gr2 Species2        0  0.0
5:   Gr1 Species3        1  0.5
6:   Gr2 Species3        1  0.5
7:   Gr1 Species4        0  0.0
8:   Gr2 Species4        0  0.0