如何聚合从几年到几十年的数据并绘制它们?
How to aggregate data from years to decades and plot them?
这是我要重现的图表:
但为此我必须更改年列,因为在图表上 x 轴是以几十年为单位的。我可以通过什么方式做到这一点?
这就是我从站点 (https://ourworldindata.org/famines) 中提取数据所做的工作:
library(rvest)
library(dplyr)
library(tidyr)
library(ggplot2)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>%
html_table() %>% . [[1]]
data1 <- col_table %>%
select(Year, `Excess Mortality midpoint`)
Year `Excess Mortality midpoint`
<chr> <chr>
1 1846–52 1,000,000
2 1860-1 2,000,000
3 1863-67 30,000
4 1866-7 961,043
5 1868 100,000
6 1868-70 1,500,000
7 1870–1871 1,000,000
8 1876–79 750,000
9 1876–79 7,176,346
10 1877–79 11,000,000
# ... with 67 more rows
首先,要将周期转换为十年,您需要为每个周期提取年份,并以此为基础进行计算。从您上面的评论来看,您似乎需要提取每个期间的结束年份。给定数据,下面使用正则表达式来执行此操作(以及包 dplyr
和 stringr
)。
col_table <- col_table %>%
mutate(Year = case_when(
grepl("^\d{4}$",Year) ~ Year,
grepl("\d{4}[–-]\d{4}",Year) ~ str_sub(Year, start= -4),
grepl("\d{4}[–-]\d{2}$",Year) ~ paste0(str_sub(Year,1,2),str_sub(Year,-2)),
grepl("\d{4}[–-]\d{1}$",Year) ~ paste0(str_sub(Year,1,3),str_sub(Year,-1))))
这部分代码的作用是检测不同的情况并提取正确的年份。下面是数据集中存在的所有情况的示例,以及这部分代码将产生的结果。
- 1868 -> 1868
- 1878-1880 -> 1880
- 1846–52 -> 1852
- 1860-1 -> 1861
现在我们有了年份,下一步就是提取十年。为此,我们需要确保 Year
列是数字并应用必要的计算(检查此处:
col_table <- col_table %>%
mutate(Decade = as.numeric(Year) - as.numeric(Year) %% 10)
要重现情节,我们需要按十年分组,并确保 Excess Mortality 中点列是数字,以便能够获得每十年的受害者总和。
col_table <- col_table %>%
mutate(`Excess Mortality midpoint` = as.numeric(gsub(",", "", `Excess Mortality midpoint`))) %>%
group_by(Decade) %>%
summarize(val = sum(`Excess Mortality midpoint`)) %>%
ungroup()
对于情节本身,使用ggplot2
:
ylab <- c(5, 10, 15, 20, 25)
options(scipen=999)
p <- ggplot(data = col_table, aes(x=factor(Decade),y=val)) +
geom_bar(stat = "identity", fill = "navy") +
scale_x_discrete(labels = col_table %>% distinct(Decade) %>% mutate(Decade = paste0(Decade,"s")) %>% pull()) +
geom_text(aes(label=format(val,big.mark=",")), size=2,vjust=-0.3) +
scale_y_continuous(labels = paste(ylab, "millions"),breaks = 10^6 * ylab) +
ggtitle('Famine victims worldwide')+
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.05, linetype = 'solid',
colour = "black"),
axis.title.x = element_blank(),
axis.title.y = element_blank())
p
所以,将所有内容放在一起,下面的代码应该为您提供年份的一列和相关十年的一列,然后应该使用它们来创建您想要的情节:
library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>% html_table() %>% . [[1]]
col_table <- col_table %>%
mutate(Year = case_when(
grepl("^\d{4}$",Year) ~Year,
grepl("\d{4}[–-]\d{4}",Year) ~ str_sub(Year, start= -4),
grepl("\d{4}[–-]\d{2}$",Year) ~ paste0(str_sub(Year,1,2),str_sub(Year,-2)),
grepl("\d{4}[–-]\d{1}$",Year) ~ paste0(str_sub(Year,1,3),str_sub(Year,-1)))) %>%
mutate(Decade = as.numeric(Year) - as.numeric(Year)%%10) %>%
mutate(`Excess Mortality midpoint` = as.numeric(gsub(",", "", `Excess Mortality midpoint`))) %>%
group_by(Decade) %>%
summarize(val = sum(`Excess Mortality midpoint`)) %>%
ungroup()
ylab <- c(5, 10, 15, 20, 25)
options(scipen=999)
p <- ggplot(data = col_table, aes(x=factor(Decade),y=val)) +
geom_bar(stat = "identity", fill = "navy") +
scale_x_discrete(labels = col_table %>% distinct(Decade) %>% mutate(Decade = paste0(Decade,"s")) %>% pull()) +
geom_text(aes(label=format(val,big.mark=",")), size=2,vjust=-0.3) +
scale_y_continuous(labels = paste(ylab, "millions"),breaks = 10^6 * ylab) +
ggtitle('Famine victims worldwide')+
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 0.05, linetype = 'solid',
colour = "black"),
axis.title.x = element_blank(),
axis.title.y = element_blank())
p
重现剧情如下:
首先,strsplit
,制作一个合适的年份矩阵,将饥荒除以年数和 reshape
组合成长格式(第 1:6 行)。接下来,aggregate
sum
年代和 barplot
它。
r <- strsplit(data1$Year, '-|–|, ') |>
rapply(\(y) unlist(lapply(y, \(x) f(max(as.numeric(y)), x))), how='r') |>
{\(.) t(sapply(., \(x) `length<-`(x, max(lengths(.)))))}() |>
{\(.) cbind(`colnames<-`(., paste0('year.', seq_len(dim(.)[2]))),
n=dim(.)[2] - rowSums(is.na(.)))}() |>
{\(.) data.frame(., f=as.numeric(gsub('\D', '',
data1$`Excess Mortality midpoint`))/
.[, 'n'])}()|>
reshape(1:3, direction='long') |>
stats:::aggregate.formula(formula=f ~ as.integer(substr(year, 1, 3)),
FUN=sum) |>
t()
## plot
op <- par(mar=c(5, 5, 4, 2)+.1) ## set/store old pars
b <- barplot(r, axes=FALSE, ylim=c(0, max(r[2, ])*1.05),
main='Famine victims', )
abline(h=asq, col='lightgrey', lty=3)
barplot(r, names.arg=paste0(r[1, ], '0s'), col='#20254c',
cex.names=.8, axes=FALSE, add=TRUE)
asq <- seq(0, max(axTicks(2)), 2e6)
axis(2, asq, labels=FALSE)
mtext(paste(asq/1e6, 'Million'), 2, 1, at=asq, las=2)
text(b, r[2, ] + 5e5, labels=formatC(r[2, ], format='d', big.mark=','), cex=.7)
box()
par(op) ## restore old pars
在第 2 行中,我使用这个辅助函数 f()
来填充 pseudo-years:
f <- \(x1, x2, n1=nchar(x1)) {
u <- lapply(list(x1, x2), as.character)
s <- c(n1 - nchar(u[[2]]) + 1L, n1)
as.integer(`substr<-`(u[[1]], s[1], s[2], u[[2]]))
}
您可以自己改进聚合方法,使结果看起来与原始结果完全一样,但也许这样更好:)