为日期段提供组 ID

Question

我正在尝试按时间段自动归属组号。因为我正在编写函数来按用户定义的不同时间段聚合天气数据的时间序列。让我们称“n”为子周期数

d1 = seq(as.Date("1910/1/1"), as.Date("1910/1/20"), "days")
d2 = seq(as.Date("1911/2/4"), as.Date("1911/2/27"), "days")
id1 = rep("1", length(d1))
id2 = rep("2", length(d2))       

df = data.frame(date = c(d1,d2), id = c(id1,id2))
df

我想将日期分成“n”个周期，并将周期编号添加到数据框的每一行：如果我想要 4 天的时间，那就是这样的：

df$period = c(rep(c(1:4), each = length(d1)/4), rep(c(1:4), each = length(d2)/4))
df

我的真实数据集中每个ID的日期长度都不一样。所以这就是为什么我要构建相同大小的第一个组和其余的最后一个组。

假设我想要第四节课：我写了这个，但这只返回了“4”：

df2 =df %>% 
  group_by(date,id) %>%
  mutate(period = c(rep(seq(1,4-1, by = 1), each = as.integer(length(date)/4)),
                    rep(4, length(date)-((4-1)*as.integer(length(date)/4))))) 
df2

有人有想法吗？

@hammoire :

因此，例如，对于第一个 ID，我有 20 个日期，如果我想将其分成 3 个时间段： c(1,1,1,1,1,1 ,2,2,2,2,2,2, 3,3,3,3,3,3,3,3)

Answer 1

使用data.table：（不是很优雅但有效）

d[, N := .N, by=id]
d[, n := floor(N/4) ]
d[, j := mapply(function(N,n) seq(1, N, by=n) %>% list, N, n)]
d[, y := ifelse(t %in% unlist(j), 1, 0), by=id]
d[, y := cumsum(y), by=id]
d[, c("N","n","j") := NULL]
d

         date id  t y
 1: 1910-01-01  1  1 1
 2: 1910-01-02  1  2 1
 3: 1910-01-03  1  3 1
 4: 1910-01-04  1  4 1
 5: 1910-01-05  1  5 1
 6: 1910-01-06  1  6 2
 7: 1910-01-07  1  7 2
 8: 1910-01-08  1  8 2
 9: 1910-01-09  1  9 2
10: 1910-01-10  1 10 2
11: 1910-01-11  1 11 3
12: 1910-01-12  1 12 3
13: 1910-01-13  1 13 3
14: 1910-01-14  1 14 3
15: 1910-01-15  1 15 3
16: 1910-01-16  1 16 4
17: 1910-01-17  1 17 4
18: 1910-01-18  1 18 4
19: 1910-01-19  1 19 4
20: 1910-01-20  1 20 4
21: 1911-02-04  2  1 1
22: 1911-02-05  2  2 1
23: 1911-02-06  2  3 1
24: 1911-02-07  2  4 1
25: 1911-02-08  2  5 1
26: 1911-02-09  2  6 1
27: 1911-02-10  2  7 2
28: 1911-02-11  2  8 2
29: 1911-02-12  2  9 2
30: 1911-02-13  2 10 2
31: 1911-02-14  2 11 2
32: 1911-02-15  2 12 2
33: 1911-02-16  2 13 3
34: 1911-02-17  2 14 3
35: 1911-02-18  2 15 3
36: 1911-02-19  2 16 3
37: 1911-02-20  2 17 3
38: 1911-02-21  2 18 3
39: 1911-02-22  2 19 4
40: 1911-02-23  2 20 4
41: 1911-02-24  2 21 4
42: 1911-02-25  2 22 4
43: 1911-02-26  2 23 4
44: 1911-02-27  2 24 4
          date id  t y

Answer 2

我会试试这个：

n_period = 4

df %>% 
  group_by(id) %>% 
  mutate(period = sort(rep_len(1:n_period, length.out = n())))
#          date id period
# 1  1910-01-01  1      1
# 2  1910-01-02  1      1
# 3  1910-01-03  1      1
# 4  1910-01-04  1      1
# 5  1910-01-05  1      1
# 6  1910-01-06  1      2
# 7  1910-01-07  1      2
# 8  1910-01-08  1      2
# 9  1910-01-09  1      2
# 10 1910-01-10  1      2
# 11 1910-01-11  1      3
# 12 1910-01-12  1      3
# 13 1910-01-13  1      3
# 14 1910-01-14  1      3
# 15 1910-01-15  1      3
# 16 1910-01-16  1      4
# 17 1910-01-17  1      4
# 18 1910-01-18  1      4
# 19 1910-01-19  1      4
# 20 1910-01-20  1      4
# ...
# 33 1911-02-16  2      3
# 34 1911-02-17  2      3
# 35 1911-02-18  2      3
# 36 1911-02-19  2      3
# 37 1911-02-20  2      3
# 38 1911-02-21  2      3
# 39 1911-02-22  2      4
# 40 1911-02-23  2      4
# 41 1911-02-24  2      4
# 42 1911-02-25  2      4
# 43 1911-02-26  2      4
# 44 1911-02-27  2      4

所有额外内容将按顺序分配给组，因此如果您有 7 个日期和 4 个时间段，它将是 1, 1, 2, 2, 3, 3, 4

或者，如果您想要最后一组中的所有额外内容，例如，4 个周期 7 个条目的情况是 1, 2, 3, 4, 4, 4, 4，这应该可行：

df %>% 
   group_by(id) %>% 
   mutate(period = c(rep(1:n_period, each = n() %/% n_period), rep(n_period, n() %% n_period)))

Answer 3

不确定这是否是您想要的？该函数允许您指定组数，但我不确定是否要为每个 id 自动定义组数。让我知道是否是这种情况，我可以尝试修改。谢谢

#n specifies the number of desired groups

group_fun <- function(v, n) {
  len_v <- length(v)
  n_per_group <- floor(length(v)/n)
  output_temp <- sort(rep(1:n, times = n_per_group))
  output <- output_temp[1:len_v]
  output[is.na(output)] <- max(output_temp, na.rm = TRUE)
  output

}

group_fun(df$period[df$id==1], 3)

df %>% 
  group_by(id) %>%
  mutate(period =  group_fun(id, n = 3))

为日期段提供组 ID

Give group ID for date periods

automation

r

period