一段时间内唯一值的累计计数

Cumulative count of unique values over time

我有一个这样的数据框mydf

| Country    | Year |
| ---------- | ---- |
| Bahamas    | 1982 |
| Chile      | 1817 |
| Cuba       | 1960 |
| Finland    | 1918 |
| Kazakhstan | 1993 |

等,还有更多行。

有没有一种简单的方法来绘制不同国家/地区的累计数量随时间的变化?也就是说,

我试过stat_ecdf(),但是y轴没有显示国家的绝对数量:

ggplot(mydata, aes(x = Year)) + stat_ecdf()

这是一个例子 mydf:

> dput(mydf)

structure(list(Country = c("Moldova", "Aragon", "Abu Dhabi", 
"Uzbekistan", "Sweden", "Anhalt", "Saudi Arabia", "Montenegro", 
"Central African Republic", "Bulgaria", "Argentina", "Senegal", 
"Sri Lanka", "Cambodia", "Benin", "Colombia", "Algeria", "Iraq", 
"DPRK", "Italy"), Year = c(1992L, 1223L, 1966L, 1993L, 1748L, 
1835L, 1955L, 1841L, 1959L, 1993L, 1806L, 1960L, 1955L, 1995L, 
1892L, 1914L, 1981L, 1958L, 1948L, 1900L)), row.names = c(NA, 
-20L), class = c("data.table", "data.frame"))

根据第一次出现给国家一个ID号,然后累计计数与该ID的累计最大值相同:

mydf = mydf[order(mydf$Year, mydf$Country), ]
mydf$country_id = as.integer(factor(mydf$Country, levels = unique(mydf$Country)))
mydf$cum_n_country = cummax(mydf$country_id)

如果年份重复,您需要 aggregate/summarize 最大值 cum_n_country 按年份计算。

library(dplyr)
library(ggplot2)
mydf %>%
  group_by(Year) %>%
  summarize(cum_n_country = max(cum_n_country)) %>%
  ggplot(aes(x = Year, y = cum_n_country)) + 
  geom_line()