如何通过 R 中的三个变量对数据帧进行排序和计数?
How to order & count a dataframe by three variables in R?
我有一个数据框 dftime,其中包含很多变量,但数据快照如下所示:
| gene | country | case_month | case_year |
| ----- | ------- | ---------- | --------- |
| gene1 | Senegal | February | 2020 |
| gene2 | Botswana| January | 2021 |
| gene3 | Congo | March | 2021 |
| gene4 | Guinea | September | 2020 |
这里有一些可重现的东西:
structure(list(gene = c("gene1", "gene2",
"gene3", "gene4", "gene5",
"gene6"), date = structure(c(18319, 18328,
18320, 18323, 18325, 18324), class = "Date"), country = c("Nigeria",
"South Africa", "Senegal", "Senegal", "Senegal", "Senegal"),
case_month = c("February", "March", "February", "March",
"March", "March"), case_year = c("2020", "2020", "2020",
"2020", "2020", "2020")), row.names = c(1L, 3L, 22L, 23L,
24L, 25L), class = "data.frame")
我在日期变量中留下了它以防它有帮助!我从 date.
中取出 case_month 和 case_year
总共有 38 个国家,所有 12 个月都有代表,仅有的两年是 2020 年和 2021 年。我正在尝试对这些数据进行排序,以便我可以获得 2020 年 1 月期间塞内加尔的基因数量, 2020 年 2 月的塞内加尔等,这样我就可以得到两年中每一个月每个国家所有基因的计数 n。我希望得到这样的输出:
| country | case_month | case_year | n |
| ------- | ---------- | --------- |---|
| Senegal | January | 2020 | 4 |
| Senegal | February | 2020 | 6 |
| Senegal | March | 2020 | 5 |
| Botswana| January | 2021 | 1 |
| Congo | March | 2021 | 2 |
等等...
目标是我可以使用此计数生成这样的堆叠条形图,其中 n 是计数的新变量:
dftime_stacked <- ggplot(dftime_ord, aes(fill=country, y= n, x=case_month)) +
geom_bar(position="stack", stat="identity")
dftime_stacked + facet_wrap(~ case_year)
我尝试使用 dplyr 和 mutate 对数据进行排序:
dftime_ord <- mutate(dftime, country = reorder(country, -n, sum),
case_month = reorder(case_month, -n, sum))
但是这会抛出两个错误——第一个是 -n,表示:
Error in -n : invalid argument to unary operator
第二个当我把它拿出来的时候,因为在这种情况下按最大到最小的排序并不是最重要的,因为无论如何我的国家都是按字母顺序排列的:
Error in tapply(X = X, INDEX = x, FUN = FUN, ...) :
arguments must have same length
我所有的变量都是字符。有没有理由无法在 dplyr 中以这种方式对它们进行排序?知道为什么会这样抛出错误吗?非常感谢大家的帮助!
您可以通过 data.table
解决方案操纵订单;
df <- read.table(textConnection(' gene | country | case_month | case_year
gene1 | Senegal | February | 2020
gene2 | Botswana| January | 2021
gene3 | Congo | March | 2021
gene4 | Guinea | September | 2020 '),sep='|',header=T)
library(data.table)
setDT(df)
df <- df[,.(n=.N),by=c('country','case_year','case_month')]
setorderv(df,c('country','case_month'),c(-1,-1))
输出;
country case_year case_month n
<fct> <dbl> <fct> <int>
1 " Senegal " 2020 " February " 1
2 " Guinea " 2020 " September " 1
3 " Congo " 2021 " March " 1
4 " Botswana" 2021 " January " 1
也许您正在寻找这个?
library(dplyr)
library(ggplot2)
df %>%
count(country, case_month, case_year) %>%
mutate(country = reorder(country, -n, sum)) %>%
ggplot(aes(fill=country, y= n, x=case_month)) +
geom_bar(position="stack", stat="identity")
我有一个数据框 dftime,其中包含很多变量,但数据快照如下所示:
| gene | country | case_month | case_year |
| ----- | ------- | ---------- | --------- |
| gene1 | Senegal | February | 2020 |
| gene2 | Botswana| January | 2021 |
| gene3 | Congo | March | 2021 |
| gene4 | Guinea | September | 2020 |
这里有一些可重现的东西:
structure(list(gene = c("gene1", "gene2",
"gene3", "gene4", "gene5",
"gene6"), date = structure(c(18319, 18328,
18320, 18323, 18325, 18324), class = "Date"), country = c("Nigeria",
"South Africa", "Senegal", "Senegal", "Senegal", "Senegal"),
case_month = c("February", "March", "February", "March",
"March", "March"), case_year = c("2020", "2020", "2020",
"2020", "2020", "2020")), row.names = c(1L, 3L, 22L, 23L,
24L, 25L), class = "data.frame")
我在日期变量中留下了它以防它有帮助!我从 date.
中取出 case_month 和 case_year总共有 38 个国家,所有 12 个月都有代表,仅有的两年是 2020 年和 2021 年。我正在尝试对这些数据进行排序,以便我可以获得 2020 年 1 月期间塞内加尔的基因数量, 2020 年 2 月的塞内加尔等,这样我就可以得到两年中每一个月每个国家所有基因的计数 n。我希望得到这样的输出:
| country | case_month | case_year | n |
| ------- | ---------- | --------- |---|
| Senegal | January | 2020 | 4 |
| Senegal | February | 2020 | 6 |
| Senegal | March | 2020 | 5 |
| Botswana| January | 2021 | 1 |
| Congo | March | 2021 | 2 |
等等...
目标是我可以使用此计数生成这样的堆叠条形图,其中 n 是计数的新变量:
dftime_stacked <- ggplot(dftime_ord, aes(fill=country, y= n, x=case_month)) +
geom_bar(position="stack", stat="identity")
dftime_stacked + facet_wrap(~ case_year)
我尝试使用 dplyr 和 mutate 对数据进行排序:
dftime_ord <- mutate(dftime, country = reorder(country, -n, sum),
case_month = reorder(case_month, -n, sum))
但是这会抛出两个错误——第一个是 -n,表示:
Error in -n : invalid argument to unary operator
第二个当我把它拿出来的时候,因为在这种情况下按最大到最小的排序并不是最重要的,因为无论如何我的国家都是按字母顺序排列的:
Error in tapply(X = X, INDEX = x, FUN = FUN, ...) :
arguments must have same length
我所有的变量都是字符。有没有理由无法在 dplyr 中以这种方式对它们进行排序?知道为什么会这样抛出错误吗?非常感谢大家的帮助!
您可以通过 data.table
解决方案操纵订单;
df <- read.table(textConnection(' gene | country | case_month | case_year
gene1 | Senegal | February | 2020
gene2 | Botswana| January | 2021
gene3 | Congo | March | 2021
gene4 | Guinea | September | 2020 '),sep='|',header=T)
library(data.table)
setDT(df)
df <- df[,.(n=.N),by=c('country','case_year','case_month')]
setorderv(df,c('country','case_month'),c(-1,-1))
输出;
country case_year case_month n
<fct> <dbl> <fct> <int>
1 " Senegal " 2020 " February " 1
2 " Guinea " 2020 " September " 1
3 " Congo " 2021 " March " 1
4 " Botswana" 2021 " January " 1
也许您正在寻找这个?
library(dplyr)
library(ggplot2)
df %>%
count(country, case_month, case_year) %>%
mutate(country = reorder(country, -n, sum)) %>%
ggplot(aes(fill=country, y= n, x=case_month)) +
geom_bar(position="stack", stat="identity")