如何通过 R 中的三个变量对数据帧进行排序和计数?

How to order & count a dataframe by three variables in R?

我有一个数据框 dftime,其中包含很多变量,但数据快照如下所示:

| gene  | country | case_month | case_year |
| ----- | ------- | ---------- | --------- |
| gene1 | Senegal | February   | 2020      |
| gene2 | Botswana| January    | 2021      |
| gene3 | Congo   | March      | 2021      |
| gene4 | Guinea  | September  | 2020      |

这里有一些可重现的东西:

structure(list(gene = c("gene1", "gene2", 
"gene3", "gene4", "gene5", 
"gene6"), date = structure(c(18319, 18328, 
18320, 18323, 18325, 18324), class = "Date"), country = c("Nigeria", 
"South Africa", "Senegal", "Senegal", "Senegal", "Senegal"), 
    case_month = c("February", "March", "February", "March", 
    "March", "March"), case_year = c("2020", "2020", "2020", 
    "2020", "2020", "2020")), row.names = c(1L, 3L, 22L, 23L, 
24L, 25L), class = "data.frame")

我在日期变量中留下了它以防它有帮助!我从 date.

中取出 case_month 和 case_year

总共有 38 个国家,所有 12 个月都有代表,仅有的两年是 2020 年和 2021 年。我正在尝试对这些数据进行排序,以便我可以获得 2020 年 1 月期间塞内加尔的基因数量, 2020 年 2 月的塞内加尔等,这样我就可以得到两年中每一个月每个国家所有基因的计数 n。我希望得到这样的输出:

| country | case_month | case_year | n |
| ------- | ---------- | --------- |---|
| Senegal | January    | 2020      | 4 |
| Senegal | February   | 2020      | 6 |
| Senegal | March      | 2020      | 5 |
| Botswana| January    | 2021      | 1 |
| Congo   | March      | 2021      | 2 |

等等...

目标是我可以使用此计数生成这样的堆叠条形图,其中 n 是计数的新变量:

dftime_stacked <- ggplot(dftime_ord, aes(fill=country, y= n, x=case_month)) + 
  geom_bar(position="stack", stat="identity")

dftime_stacked + facet_wrap(~ case_year)

我尝试使用 dplyr 和 mutate 对数据进行排序:

dftime_ord <- mutate(dftime, country = reorder(country, -n, sum),
                     case_month = reorder(case_month, -n, sum))

但是这会抛出两个错误——第一个是 -n,表示:

Error in -n : invalid argument to unary operator

第二个当我把它拿出来的时候,因为在这种情况下按最大到最小的排序并不是最重要的,因为无论如何我的国家都是按字母顺序排列的:

Error in tapply(X = X, INDEX = x, FUN = FUN, ...) : 
  arguments must have same length

我所有的变量都是字符。有没有理由无法在 dplyr 中以这种方式对它们进行排序?知道为什么会这样抛出错误吗?非常感谢大家的帮助!

您可以通过 data.table 解决方案操纵订单;

df <- read.table(textConnection(' gene  | country | case_month | case_year 
 gene1 | Senegal | February   | 2020      
 gene2 | Botswana| January    | 2021      
 gene3 | Congo   | March      | 2021      
 gene4 | Guinea  | September  | 2020      '),sep='|',header=T)

library(data.table)

setDT(df)

df <- df[,.(n=.N),by=c('country','case_year','case_month')]

setorderv(df,c('country','case_month'),c(-1,-1))

输出;

  country     case_year case_month         n
  <fct>           <dbl> <fct>          <int>
1 " Senegal "      2020 " February   "     1
2 " Guinea  "      2020 " September  "     1
3 " Congo   "      2021 " March      "     1
4 " Botswana"      2021 " January    "     1

也许您正在寻找这个?

library(dplyr)
library(ggplot2)

df %>%
  count(country, case_month, case_year) %>%
  mutate(country = reorder(country, -n, sum)) %>%
  ggplot(aes(fill=country, y= n, x=case_month)) + 
  geom_bar(position="stack", stat="identity")