如何聚合从几年到几十年的数据并绘制它们?

How to aggregate data from years to decades and plot them?

这是我要重现的图表:

但为此我必须更改年列,因为在图表上 x 轴是以几十年为单位的。我可以通过什么方式做到这一点?

这就是我从站点 (https://ourworldindata.org/famines) 中提取数据所做的工作:

library(rvest)
library(dplyr)
library(tidyr)
library(ggplot2)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>% 
  html_table() %>% . [[1]]
data1 <- col_table %>%
  select(Year, `Excess Mortality midpoint`)
 Year      `Excess Mortality midpoint`
   <chr>     <chr>                      
 1 1846–52   1,000,000                  
 2 1860-1    2,000,000                  
 3 1863-67   30,000                     
 4 1866-7    961,043                    
 5 1868      100,000                    
 6 1868-70   1,500,000                  
 7 1870–1871 1,000,000                  
 8 1876–79   750,000                    
 9 1876–79   7,176,346                  
10 1877–79   11,000,000                 
# ... with 67 more rows

首先,要将周期转换为十年,您需要为每个周期提取年份,并以此为基础进行计算。从您上面的评论来看,您似乎需要提取每个期间的结束年份。给定数据,下面使用正则表达式来执行此操作(以及包 dplyrstringr)。

col_table <- col_table %>% 
  mutate(Year = case_when(
    grepl("^\d{4}$",Year) ~ Year,
    grepl("\d{4}[–-]\d{4}",Year) ~ str_sub(Year, start= -4),
    grepl("\d{4}[–-]\d{2}$",Year) ~ paste0(str_sub(Year,1,2),str_sub(Year,-2)),
    grepl("\d{4}[–-]\d{1}$",Year) ~ paste0(str_sub(Year,1,3),str_sub(Year,-1))))

这部分代码的作用是检测不同的情况并提取正确的年份。下面是数据集中存在的所有情况的示例,以及这部分代码将产生的结果。

  • 1868 -> 1868
  • 1878-1880 -> 1880
  • 1846–52 -> 1852
  • 1860-1 -> 1861

现在我们有了年份,下一步就是提取十年。为此,我们需要确保 Year 列是数字并应用必要的计算(检查此处:

  col_table <- col_table %>% 
      mutate(Decade = as.numeric(Year) - as.numeric(Year) %% 10)

要重现情节,我们需要按十年分组,并确保 Excess Mortality 中点列是数字,以便能够获得每十年的受害者总和。

col_table <- col_table %>% 
  mutate(`Excess Mortality midpoint` = as.numeric(gsub(",", "", `Excess Mortality midpoint`))) %>% 
  group_by(Decade) %>% 
  summarize(val = sum(`Excess Mortality midpoint`)) %>% 
  ungroup()

对于情节本身,使用ggplot2

ylab <- c(5, 10, 15, 20, 25)
options(scipen=999)
p <- ggplot(data = col_table, aes(x=factor(Decade),y=val)) + 
   geom_bar(stat = "identity", fill = "navy") + 
   scale_x_discrete(labels = col_table %>% distinct(Decade) %>% mutate(Decade = paste0(Decade,"s")) %>% pull()) +
   geom_text(aes(label=format(val,big.mark=",")), size=2,vjust=-0.3) + 
   scale_y_continuous(labels = paste(ylab, "millions"),breaks = 10^6 * ylab) +
   ggtitle('Famine victims worldwide')+
   theme(panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(size = 0.05, linetype = 'solid',
                                        colour = "black"),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())
p

所以,将所有内容放在一起,下面的代码应该为您提供年份的一列和相关十年的一列,然后应该使用它们来创建您想要的情节:

library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)

col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>% html_table() %>% . [[1]]

col_table <- col_table %>% 
  mutate(Year = case_when(
    grepl("^\d{4}$",Year) ~Year,
    grepl("\d{4}[–-]\d{4}",Year) ~ str_sub(Year, start= -4),
    grepl("\d{4}[–-]\d{2}$",Year) ~ paste0(str_sub(Year,1,2),str_sub(Year,-2)),
    grepl("\d{4}[–-]\d{1}$",Year) ~ paste0(str_sub(Year,1,3),str_sub(Year,-1)))) %>%
  mutate(Decade = as.numeric(Year) - as.numeric(Year)%%10) %>% 
  mutate(`Excess Mortality midpoint` = as.numeric(gsub(",", "", `Excess Mortality midpoint`))) %>% 
  group_by(Decade) %>% 
  summarize(val = sum(`Excess Mortality midpoint`)) %>% 
  ungroup()

ylab <- c(5, 10, 15, 20, 25)
options(scipen=999)
p <- ggplot(data = col_table, aes(x=factor(Decade),y=val)) + 
   geom_bar(stat = "identity", fill = "navy") + 
   scale_x_discrete(labels = col_table %>% distinct(Decade) %>% mutate(Decade = paste0(Decade,"s")) %>% pull()) +
   geom_text(aes(label=format(val,big.mark=",")), size=2,vjust=-0.3) + 
   scale_y_continuous(labels = paste(ylab, "millions"),breaks = 10^6 * ylab) +
   ggtitle('Famine victims worldwide')+
   theme(panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(size = 0.05, linetype = 'solid',
                                        colour = "black"),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())
p

重现剧情如下:

首先,strsplit,制作一个合适的年份矩阵,将饥荒除以年数和 reshape 组合成长格式(第 1:6 行)。接下来,aggregate sum 年代和 barplot 它。

r <- strsplit(data1$Year, '-|–|, ') |>
  rapply(\(y) unlist(lapply(y, \(x) f(max(as.numeric(y)), x))), how='r') |>
  {\(.) t(sapply(., \(x) `length<-`(x, max(lengths(.)))))}() |>
  {\(.) cbind(`colnames<-`(., paste0('year.', seq_len(dim(.)[2]))),
         n=dim(.)[2] - rowSums(is.na(.)))}() |>
  {\(.) data.frame(., f=as.numeric(gsub('\D', '', 
                                        data1$`Excess Mortality midpoint`))/
               .[, 'n'])}()|>
  reshape(1:3, direction='long') |>
  stats:::aggregate.formula(formula=f ~ as.integer(substr(year, 1, 3)), 
                            FUN=sum) |>
  t() 


## plot
op <- par(mar=c(5, 5, 4, 2)+.1)  ## set/store old pars

b <- barplot(r, axes=FALSE, ylim=c(0, max(r[2, ])*1.05),
        main='Famine victims', )
abline(h=asq, col='lightgrey', lty=3)
barplot(r, names.arg=paste0(r[1, ], '0s'), col='#20254c',
        cex.names=.8, axes=FALSE, add=TRUE)
asq <- seq(0, max(axTicks(2)), 2e6)
axis(2, asq, labels=FALSE)
mtext(paste(asq/1e6, 'Million'), 2, 1, at=asq, las=2)
text(b, r[2, ] + 5e5, labels=formatC(r[2, ], format='d', big.mark=','), cex=.7)
box()

par(op)  ## restore old pars

在第 2 行中,我使用这个辅助函数 f() 来填充 pseudo-years:

f <- \(x1, x2, n1=nchar(x1)) {
  u <- lapply(list(x1, x2), as.character)
  s <- c(n1 - nchar(u[[2]]) + 1L, n1)
  as.integer(`substr<-`(u[[1]], s[1], s[2], u[[2]]))
}

您可以自己改进聚合方法,使结果看起来与原始结果完全一样,但也许这样更好:)