
Efficient calculation of months (continuous) since last birthday at snapshot date

我有一个 data.frame,其中包含两个日期列,一个是个人的出生日期 (DOB),一个是参考时间点 (Snapshot.Date),让我们说这是我们最后一次见到那个人的日期。还有其他列(省略),所以我希望将结果作为列添加到我现有的 data.frame.


我尝试了 plyr 解决方案和 base sapply 解决方案,它们都比我预期的要慢 --(我需要在我的 'real' [=35 中处理一百万行=])


data.test  = structure(list(Snapshot.Date = structure(c(1433030400, 1396224000, 
                                                        1375228800, 1396224000, 1383177600, 1362009600, 1367280000, 1369958400, 
                                                        1346371200, 1348963200, 1435622400, 1435622400, 1435622400, 1435622400, 
                                                        1435622400, 1435622400, 1435622400, 1435622400, 1435622400, 1346371200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), DOB = structure(c(-268790400, 
                                                                     -155692800, -955065600, -551232000, -149644800, -774230400, -485395200, 
                                                                     -17625600, -131932800, -387244800, 545961600, 18489600, -230515200, 
                                                                     441676800, -32745600, 775180800, 713491200, 483235200, 114307200, 
                                                                     -815443200), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Snapshot.Date", 
                                                                                                                                              "DOB"), row.names = c(32806L, 21294L, 14880L, 21730L, 17525L, 
                                                                                                                                                                    8516L, 11068L, 11751L, 2564L, 3832L, 802276L, 1031697L, 129222L, 
                                                                                                                                                                    588224L, 1093247L, 878037L, 370736L, 709108L, 861908L, 2199L), class = "data.frame")


months_since_last_birthday = function(CurrentDate, DateOfBirth)
  last_birthday = DateOfBirth

  if(month(last_birthday) == 2 & day(last_birthday) == 29) # this birthday only occurs once every four years, let's reset them to be the 28th
    day(last_birthday) = 28 

  year(last_birthday) = year(CurrentDate)

  if(last_birthday > CurrentDate)
    last_birthday = last_birthday - years(1) #last year's birthday is the most recent occurrence

  return(as.period(new_interval(last_birthday, CurrentDate)) / months(1)) 

对于以 20 为基数的记录,这里是所需的输出:

       Snapshot.Date        DOB Months.Since.Birthday
32806      2015-05-31 1961-06-26            11.1643836
21294      2014-03-31 1965-01-25             2.1972603
14880      2013-07-31 1939-09-27            10.1315068
21730      2014-03-31 1952-07-14             8.5589041
17525      2013-10-31 1965-04-05             6.8547945
8516       2013-02-28 1945-06-20             8.2630137
11068      2013-04-30 1954-08-15             8.4931507
11751      2013-05-31 1969-06-11            11.6575342
2564       2012-08-31 1965-10-27            10.1315068
3832       2012-09-30 1957-09-24             0.1972603
802276     2015-06-30 1987-04-21             2.2958904
1031697    2015-06-30 1970-08-03            10.8876712
129222     2015-06-30 1962-09-12             9.5917808
588224     2015-06-30 1983-12-31             5.9863014
1093247    2015-06-30 1968-12-18             6.3945205
878037     2015-06-30 1994-07-26            11.1315068
370736     2015-06-30 1992-08-11            10.6246575
709108     2015-06-30 1985-04-25             2.1643836
861908     2015-06-30 1973-08-16            10.4602740
2199       2012-08-31 1944-02-29             6.0986301


# Make 5000 records total for benchmarking, didn't replicate Feb 29th
# since it is a very rare case in the data
data.test = rbind(data.test, data.test[sample(1:19, size = 4980, replace = TRUE),])

start.time = Sys.time()
res = suppressMessages(adply(data.test , 1, transform, Months.Since.Birthday = months_since_last_birthday(Snapshot.Date, DOB)))
end.time = Sys.time()

# end.time - start.time
# Time difference of 1.793945 mins

start.time = Sys.time()
data.test$Months.Since.Birthday = suppressMessages(sapply(1:5000, function(x){return(months_since_last_birthday(data.test$Snapshot.Date[x], data.test$DOB[x]))}))
end.time = Sys.time()

# end.time - start.time
# Time difference of 1.743053 mins

我做错了什么严重的事吗?这对你来说真的很慢吗? 欢迎任何反馈!

除非我遗漏了一些明显的东西,否则 R 中有很多处理时间数据的内置方法,特别是 base::difftime,这可能为您省去了一些麻烦。

获取上面的数据集 data.test:

data.test$dif <- round(as.vector(as.double(difftime(strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"), strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days"))) %% 365, 1)


data.test$dif <- 
          strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"),
          strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days")
      %% 365,

上面使用difftime函数求出给定格式(format = "%Y-%m-%d")的两个日期的天数之差,然后进行余数除法得到自最后一个生日。我个人认为这是比月数更好的衡量标准,因为 7 月和 8 月之间相差 2 个月与 1 月和 2 月之间相差 2 个月的天数不同。

注意:上述解法不包含闰年。您可以轻松地查找闰年列表,并在检查中增加 1 天,或者从每个经历过闰年的人的生日中减去 1 天,以获得准确的数字。