在给定出生日期和任意日期的 R 中高效准确地计算年龄(以年、月或周为单位)

Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date

我面临着在给定出生日期和任意日期的情况下计算年龄(以年、月或周为单位)的常见任务。问题是我经常需要对很多记录(>3 亿)执行此操作,因此性能是这里的关键问题。

在 SO 和 Google 中快速搜索后,我找到了 3 个备选方案:

所以,这是我的玩具代码:

# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01", 
                       "1962-12-30", "1962-12-31", "1963-01-01", 
                       "2000-06-16", "2000-06-17", "2000-06-18", 
                       "2007-03-18", "2007-03-19", "2007-03-20", 
                       "1968-02-29", "1968-02-29", "1968-02-29"))

# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31", 
                       "2015-12-31", "2015-12-31", "2015-12-31", 
                       "2050-06-17", "2050-06-17", "2050-06-17",
                       "2008-03-19", "2008-03-19", "2008-03-19", 
                       "2015-02-28", "2015-03-01", "2015-03-02"))

# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25

# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) / 
                     duration(num = 1, units = "years")

# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")

我们稍后再谈准确性,首先关注性能。这是代码:

# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = new_interval(start = birthdate, end = givendate) /
                                     duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    times = 1000
)

# And examine the results
mbm
autoplot(mbm)

这里是结果:

底线:lubridateeeptools函数的性能比算术法差很多(/365.25至少快10倍)。不幸的是,算术方法不够准确,我不能承受这种方法会犯的一些错误。

"because of the way the modern Gregorian calendar is constructed, there is no straightforward arithmetic method that produces a person’s age, stated according to common usage — common usage meaning that a person’s age should always be an integer that increases exactly on a birthday". (link)

正如我在一些帖子中所读到的,lubridateeeptools 没有犯过这样的错误(尽管我还没有查看 code/read 更多关于这些函数的信息以了解哪种方法他们使用),这就是我想使用它们的原因,但它们的性能不适用于我的实际应用。

关于计算年龄的有效且准确的方法有什么想法吗?

编辑

哎呀,好像lubridate也出错了。并且显然基于这个玩具示例,它比算术方法犯的错误更多(见第 3、6、9、12 行)。 (我做错了什么吗?)

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = new_interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df

    birthdate  givendate arithmetic lubridate eeptools
1  1978-12-30 2015-12-31         37        37       37
2  1978-12-31 2015-12-31         36        37       37
3  1979-01-01 2015-12-31         36        37       36
4  1962-12-30 2015-12-31         53        53       53
5  1962-12-31 2015-12-31         52        53       53
6  1963-01-01 2015-12-31         52        53       52
7  2000-06-16 2050-06-17         50        50       50
8  2000-06-17 2050-06-17         49        50       50
9  2000-06-18 2050-06-17         49        50       49
10 2007-03-18 2008-03-19          1         1        1
11 2007-03-19 2008-03-19          1         1        1
12 2007-03-20 2008-03-19          0         1        0
13 1968-02-29 2015-02-28         46        47       46
14 1968-02-29 2015-03-01         47        47       47
15 1968-02-29 2015-03-02         47        47       47

好的,所以我在另一个 post:

中找到了这个函数
age <- function(from, to) {
    from_lt = as.POSIXlt(from)
    to_lt = as.POSIXlt(to)

    age = to_lt$year - from_lt$year

    ifelse(to_lt$mon < from_lt$mon |
               (to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
           age - 1, age)
}

@Jim post编辑说 "The following function takes a vectors of Date objects and calculates the ages, correctly accounting for leap years. Seems to be a simpler solution than any of the other answers"。

它确实更简单,并且可以实现我一直在寻找的技巧。平均而言,它实际上比算术法更快(大约快 75%)。

mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    age = age(from = birthdate, to = givendate),
    times = 1000
)
mbm
autoplot(mbm)

至少在我的示例中它不会出错(在任何示例中都不应该;它是一个使用 ifelses 的非常简单的函数)。

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years"),
    age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df

    birthdate  givendate arithmetic lubridate eeptools age
1  1978-12-30 2015-12-31         37        37       37  37
2  1978-12-31 2015-12-31         36        37       37  37
3  1979-01-01 2015-12-31         36        37       36  36
4  1962-12-30 2015-12-31         53        53       53  53
5  1962-12-31 2015-12-31         52        53       53  53
6  1963-01-01 2015-12-31         52        53       52  52
7  2000-06-16 2050-06-17         50        50       50  50
8  2000-06-17 2050-06-17         49        50       50  50
9  2000-06-18 2050-06-17         49        50       49  49
10 2007-03-18 2008-03-19          1         1        1   1
11 2007-03-19 2008-03-19          1         1        1   1
12 2007-03-20 2008-03-19          0         1        0   0
13 1968-02-29 2015-02-28         46        47       46  46
14 1968-02-29 2015-03-01         47        47       47  47
15 1968-02-29 2015-03-02         47        47       47  47

我不认为它是一个完整的解决方案,因为我还想要以月和周为单位的年龄,而这个功能是特定于年的。无论如何,我 post 在这里,因为它解决了年龄问题。我不会接受因为:

  1. 我会等待@Jim post 它作为答案。
  2. 我会等着看是否有人想出了一个完整的解决方案(高效、准确并根据需要以年、月或周为单位生产年龄)。

上述 lubridate 出现错误的原因是您计算的是持续时间(两个瞬间之间发生的确切时间量,其中 1 年 = 31536000 秒),而不是周期(发生的时钟时间变化)在两个瞬间之间)。

要获取时钟时间的变化(以年、月、日等为单位),您需要使用

as.period(interval(start = birthdate, end = givendate))

给出以下输出

 "37y 0m 1d 0H 0M 0S"   
 "37y 0m 0d 0H 0M 0S"   
 "36y 11m 30d 0H 0M 0S" 
 ...
 "46y 11m 30d 1H 0M 0S" 
 "47y 0m 0d 1H 0M 0S"   
 "47y 0m 1d 1H 0M 0S" 

要只提取年份,您可以使用以下方法

as.period(interval(start = birthdate, end = givendate))$year
 [1] 37 37 36 53 53 52 50 50 49  1  1  0 46 47 47

遗憾的是,注意出现的速度比上述方法还要慢!

> mbm
Unit: microseconds
       expr       min        lq       mean    median         uq        max neval cld
 arithmetic   116.595   138.149   181.7547   184.335   196.8565   5556.306  1000  a 
  lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935  1000   b

我本来打算把它留在评论中,但我认为它值得单独回答。正如@Molx 指出的那样,您的 "arithmetic" 方法并不像看起来那么简单——看看 -.Date 的代码,最重要的是:

return(difftime(e1, e2, units = "days"))

因此,class Date 对象上的 "arithmetic" 方法实际上是 difftime 函数的包装器。 difftime 呢?如果您追求的是原始速度,这也会带来大量开销。

关键是 Date 对象存储为整数天数 since/until 1970 年 1 月 1 日(尽管它们实际上并未存储为 integer,因此IDate class 在 data.table 中的诞生),所以我们可以减去这些并完成它,但是为了避免调用 -.Date 方法,我们必须 unclass 我们的输入:

(unclass(birthdate) - unclass(givendate)) / 365.25

就物有所值而言,这种方法甚至比@Jim 的 age 方法还要快几个数量级。

这里有一些更大规模的测试数据:

set.seed(20349)
NN <- 1e6
birthdate <- as.Date(sprintf('%d-%02d-%02d',
                             sample(1901:2030, NN, TRUE),
                             sample(12, NN, TRUE),
                             sample(28, NN, TRUE)))

#average 30 years, most data between 20 and 40 years
givendate <- birthdate + as.integer(rnorm(NN, mean = 10950, sd = 1000))

(不包括 eeptools,因为它慢得几乎不可能——看一眼 age_calc 的代码就会发现代码甚至 创建了一系列日期每对日期O(n^2)-ish),更不用说 ifelses)

的胡椒粉了
microbenchmark(
  arithmetic = (givendate - birthdate) / 365.25,
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  age = age(from = birthdate, to = givendate),
  fastar = (unclass(givendate) - unclass(birthdate)) / 365.25,
  overlaps = get_age(birthdate, givendate),
  times = 50)
# Unit: milliseconds
#        expr        min         lq      mean     median         uq      max neval  cld
#  arithmetic  28.153465  30.384639  62.96118  31.492764  34.052991 180.9556    50  b  
#   lubridate  94.327968  97.233009 157.30420 102.751351 240.717065 265.0283    50   c 
#         age 338.347756 479.598513 483.84529 483.580981 488.090832 770.1149    50    d
#      fastar   7.740098   7.831528  11.02521   7.913146   8.090902 153.3645    50 a   
#    overlaps 316.408920 458.734073 459.58974 463.806255 470.320072 769.0929    50    d

因此,我们也强调了对小规模数据进行基准测试的愚蠢行为。

@Jim 方法的最大成本是 as.POSIXlt 随着向量的增长而变得越来越昂贵。

不准确的问题仍然存在,但除非这种准确性是最重要的,否则 unclass 方法似乎是无与伦比的。

我一直在努力解决这个问题,终于有了一个东西是 a) 完美 准确*(与 all到目前为止提出的其他选项)和 b)相当快(参见我在另一个答案中的基准)。它依赖于我手工完成的一系列算术运算以及来自 data.table 包的精彩 foverlaps 函数。

该方法的本质是从 Dates 的整数表示开始工作,并识别所有出生日期都在 1461 (= 365 * 4 + 1) 天的四个日期之一周期,取决于明年是什么时候,你的生日需要 366 天。

函数如下:

library(data.table)
get_age <- function(birthdays, ref_dates){
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                     I4[.BY[[1L]],])),
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra
}

比较你的主要例子:

toy_df <- data.frame(
  birthdate = birthdate,
  givendate = givendate,
  arithmetic = as.numeric((givendate - birthdate) / 365.25),
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  eeptools = age_calc(dob = birthdate, enddate = givendate,
                      units = "years"),
  mine = get_age(birthdate, givendate)
)

toy_df
#     birthdate  givendate arithmetic lubridate   eeptools       mine
# 1  1978-12-30 2015-12-31 37.0020534 37.027397 37.0027397 37.0027322 #eeptools wrong: will be 366 days until 12/31/16, so fraction is 1/366
# 2  1978-12-31 2015-12-31 36.9993155 37.024658 37.0000000 37.0000000
# 3  1979-01-01 2015-12-31 36.9965777 37.021918 36.9972603 36.9972603
# 4  1962-12-30 2015-12-31 53.0020534 53.038356 53.0027397 53.0027322 #same problem
# 5  1962-12-31 2015-12-31 52.9993155 53.035616 53.0000000 53.0000000
# 6  1963-01-01 2015-12-31 52.9965777 53.032877 52.9972603 52.9972603
# 7  2000-06-16 2050-06-17 50.0013689 50.035616 50.0000000 50.0027397 #eeptools wrong: not exactly the birthday
# 8  2000-06-17 2050-06-17 49.9986311 50.032877 50.9972603 50.0000000 #eeptools wrong: _is_ exactly the birthday
# 9  2000-06-18 2050-06-17 49.9958932 50.030137 49.9945205 49.9972603 #eeptools wrong: fraction should be 364/365
# 10 2007-03-18 2008-03-19  1.0047912  1.005479  1.0027322  1.0027397 #eeptools wrong: 2/29 already passed, only 365 days until 3/19/2009
# 11 2007-03-19 2008-03-19  1.0020534  1.002740  1.0000000  1.0000000
# 12 2007-03-20 2008-03-19  0.9993155  1.000000  0.9966839  0.9972678 #eeptools wrong: we passed 2/29, so should be 365/366
# 13 1968-02-29 2015-02-28 46.9979466 47.030137 46.9977019 46.9972603 #my judgment: birthday occurs on 3/1 for 2/29 babies, so 364/365 the way there
# 14 1968-02-29 2015-03-01 47.0006845 47.032877 47.0000000 47.0000000
# 15 1968-02-29 2015-03-02 47.0034223 47.035616 47.0027397 47.0027322

这种风格的方法可以扩展到很容易地处理 months/weeks。几个月会有点冗长(必须指定 4 年的月份长度),所以我没有打扰;周很容易(周不受闰年因素的影响,所以我们可以除以 7)。

我在使用 base 功能方面也取得了很多进展,但是 a) 它 相当 丑陋(需要 0- 的非线性变换1460 以避免执行嵌套的 ifelse 语句等)和 b) 最后一个 for 循环(在整个日期列表中以 apply 的形式)是不可避免的,所以我决定这会减慢东西倒太多了。 (转换为x1 = (unclass(birthdays) - 59) %% 1461; x2 = x1 * (729 - x1) / 402232 + x1,为了后代)

我已将此功能添加到 my package

*(对于 non-leap centuries 不是问题的日期范围;我认为处理此类日期的扩展不应太麻烦)