在年份变化时拆分一行

Splitting a row at year change

我有一个大型数据集,代表成对的时间块,但是我希望能够在每一行在同一年开始和结束的情况下跨年界限有一个干净的突破。

例如,请参见下面的 table。

   type duration cumsum year year.split
1     1      236    236    1        365
2     0      129    365    1        365
3     1      154    519    2        730
4     0      216    735    3       1095

第一年和第二年之间没有重叠,因为第 3 行从第二年的第一天开始,但是第 4 行从第二年开始,到第三年的 5 天结束。我想拆分第 4 行,使 table 如下所示。

   type duration cumsum year year.split
1     1      236    236    1        365
2     0      129    365    1        365
3     1        0    519    1        365
4     1      154    519    2        730
5     0      211    524    2        730
6     0        5    735    3       1095

可以看出,跨年没有重叠,因为每个重叠的时间块都已拆分,因此每一行都在同一年开始和结束。到目前为止我这样做的方式如下,但是它看起来很笨拙,我希望有一个更优雅的解决方案。

set.seed(808)
test <- data.frame(type = c(1,0), duration =  round(runif(20, min = 100, max = 250))) %>%
  mutate(cumsum = cumsum(duration), year = ceiling(cumsum/365), year.split = year*365 )

test <- rbind(test[1,],
      filter(test, lag(year) == year), 
      filter(test, lag(year) != year) %>% 
      mutate( duration = cumsum - (year-1)*365),
      filter(test, lag(year) != year) %>% 
        mutate( duration = ((year-1)*365 + duration- cumsum), 
                cumsum = cumsum-duration, 
                year = year -1, 
                year.split = year*365) ) %>% arrange(year, cumsum)


test <- group_by( test,type, year) %>%
  summarise( duration = sum(duration)) %>% ungroup %>% arrange(year)

最后两行代码总结了数据,因为我对每年每种类型的总量感兴趣。

执行此操作的更好方法是什么?

假设持续时间都严格为正,这似乎可行:

cs<-test$cumsum
cs0<-sort(unique(c(cs,(1:floor(max(cs)/365))*365)))
data.frame(type=test$type[findInterval(cs0-0.5,cs)+1],
           duration=diff(c(0,cs0)),cumsum=cs0,year=ceiling(cs0/365))

  type duration cumsum year
1    1      236    236    1
2    0      129    365    1
3    1      154    519    2
4    0      211    730    2
5    0        5    735    3

不确定这是否是您正在寻找的 R 方式,但您可以稍微简化一下 rbind 函数:

rbind (filter(test, cumsum - duration >= (year - 1) * 365),
       filter(test, cumsum - duration < (year - 1) * 365) %>%
         mutate(duration = cumsum - (year - 1) * 365),
       filter(test, cumsum - duration < (year - 1) * 365) %>%
         mutate(year = year - 1, # I'm changing the year first so it will propagate
                duration = duration - (cumsum - (year * 365)),
                cumsum = (year) * 365,
                year.split = year * 365) 
               )

如你所见,我结合了三个 data.frame:

  1. 正确的行,因为持续时间不与两年重叠
  2. 我将行重叠并将持续时间设置为去年的天数
  3. 我采用相同的行,并相应地更改前一年的值。

这里有两件事我不喜欢:我使用了两次相同的过滤器(对于情况 2 和 3),明天我将需要 10/15 分钟来理解这段代码(或者我可以发表评论,例如# It works, don't worry).

我认为此代码的更详细版本将更易于维护:

# These don't overlap        
ok <- filter(test, cumsum - duration >= (year - 1) * 365)

# These do overlap! We need to split them in two
ko <- filter(test, cumsum - duration < (year - 1) * 365)

# For the most recent year, it's enough to change the duration
ko.recent <- mutate(ko, 
                    duration = cumsum - (year - 1) * 365
) 

# For the previous year, a bit more
ko.previous <- mutate(ko, 
                      year = year - 1, # I'm changing the year first
                                       # so it will propagate
                      duration = duration - (cumsum - (year * 365)),
                      cumsum = (year) * 365,
                      year.split = year * 365
) 

# Let me put them back together and sort them for you
test1 <- rbind (ok,
               ko.recent,
               ko.previous
              ) 

不确定这是否是您要找的答案,我只是在学习 R