为什么 group_by(year, month) 之后的 mutate 似乎错过了一行?
Why does a mutate following a group_by(year, month) seem to miss a row?
我有一个日周期数据框,我正在将其转换为月周期,包括基于汇总值的简单转换:
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year = year(date),
month = month(date)
) %>% group_by(year, month) %>% summarise(
date = last(date),
month.close = last(index),
) %>% mutate(
month.change = log(month.close / lag(month.close))
)
代码看起来很简单,但是当我 运行 它时,我感到有些奇怪:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 4 x 5
# Groups: year [2]
year month date month.close month.change
<dbl> <dbl> <date> <dbl> <dbl>
1 2002 12 2002-12-31 403. NA
2 2003 1 2003-01-31 419. NA
3 2003 2 2003-02-28 422. 0.00572
4 2003 3 2003-03-01 417. -0.0121
尽管第 1 行和第 2 行具有有效的 month.close
值,但为什么第 2 行没有 month.change
值? summarize() 操作是否分别作用于两个给定的维度?
我真的需要了解为什么会发生这种行为,所以请不要只是告诉我使用不同的函数来折叠周期性,我真的很想知道哪个我对部分实现的理解不正确,所以我以后不会在其他地方插入类似的错误。我知道这与按 2 个变量分组有关,因为当我将两列简化为一列时,我得到了预期的行为。
此代码:
library(zoo)
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year.month = as.yearmon(date)
) %>% group_by(year.month) %>% summarise(
date = last(date),
month.close = last(index),
) %>% mutate(
month.change = log(month.close / lag(month.close))
)
returns 预期结果
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 4
year.month date month.close month.change
<yearmon> <date> <dbl> <dbl>
1 Dec 2002 2002-12-31 405. NA
2 Jan 2003 2003-01-31 428. 0.0560
3 Feb 2003 2003-02-28 421. -0.0173
4 Mar 2003 2003-03-01 423. 0.00513
我错过了什么?
当您将 group_by
与 summarise
一起使用时,默认情况下仅删除最后一级分组。
所以在这个阶段你的数据仍然按year
分组。
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year = year(date),
month = month(date)
) %>% group_by(year, month) %>% summarise(
date = last(date),
month.close = last(index))
# A tibble: 4 x 4
# Groups: year [2] # <- Notice this
# year month date month.close
# <int> <int> <date> <dbl>
#1 2002 12 2002-12-31 411.
#2 2003 1 2003-01-31 393.
#3 2003 2 2003-02-28 406.
#4 2003 3 2003-03-01 398.
要克服此行为,您可以在上述步骤后指定 .groups = 'drop'
或使用 ungroup()
。
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year = year(date),
month = month(date)
) %>% group_by(year, month) %>% summarise(
date = last(date),
month.close = last(index), .groups = 'drop',
) %>% mutate(
month.change = log(month.close / lag(month.close))
)
# year month date month.close month.change
# <int> <int> <date> <dbl> <dbl>
#1 2002 12 2002-12-31 399. NA
#2 2003 1 2003-01-31 380. -0.0510
#3 2003 2 2003-02-28 381. 0.00257
#4 2003 3 2003-03-01 381. 0.000673
对于第二步,因为您的数据仅按一个键分组,所以它在 summarise
之后被删除,您将获得预期的输出。
我有一个日周期数据框,我正在将其转换为月周期,包括基于汇总值的简单转换:
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year = year(date),
month = month(date)
) %>% group_by(year, month) %>% summarise(
date = last(date),
month.close = last(index),
) %>% mutate(
month.change = log(month.close / lag(month.close))
)
代码看起来很简单,但是当我 运行 它时,我感到有些奇怪:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 4 x 5
# Groups: year [2]
year month date month.close month.change
<dbl> <dbl> <date> <dbl> <dbl>
1 2002 12 2002-12-31 403. NA
2 2003 1 2003-01-31 419. NA
3 2003 2 2003-02-28 422. 0.00572
4 2003 3 2003-03-01 417. -0.0121
尽管第 1 行和第 2 行具有有效的 month.close
值,但为什么第 2 行没有 month.change
值? summarize() 操作是否分别作用于两个给定的维度?
我真的需要了解为什么会发生这种行为,所以请不要只是告诉我使用不同的函数来折叠周期性,我真的很想知道哪个我对部分实现的理解不正确,所以我以后不会在其他地方插入类似的错误。我知道这与按 2 个变量分组有关,因为当我将两列简化为一列时,我得到了预期的行为。
此代码:
library(zoo)
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year.month = as.yearmon(date)
) %>% group_by(year.month) %>% summarise(
date = last(date),
month.close = last(index),
) %>% mutate(
month.change = log(month.close / lag(month.close))
)
returns 预期结果
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 4
year.month date month.close month.change
<yearmon> <date> <dbl> <dbl>
1 Dec 2002 2002-12-31 405. NA
2 Jan 2003 2003-01-31 428. 0.0560
3 Feb 2003 2003-02-28 421. -0.0173
4 Mar 2003 2003-03-01 423. 0.00513
我错过了什么?
当您将 group_by
与 summarise
一起使用时,默认情况下仅删除最后一级分组。
所以在这个阶段你的数据仍然按year
分组。
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year = year(date),
month = month(date)
) %>% group_by(year, month) %>% summarise(
date = last(date),
month.close = last(index))
# A tibble: 4 x 4
# Groups: year [2] # <- Notice this
# year month date month.close
# <int> <int> <date> <dbl>
#1 2002 12 2002-12-31 411.
#2 2003 1 2003-01-31 393.
#3 2003 2 2003-02-28 406.
#4 2003 3 2003-03-01 398.
要克服此行为,您可以在上述步骤后指定 .groups = 'drop'
或使用 ungroup()
。
tibble(
date = ymd("2002-12-31") + c(0:60),
index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
year = year(date),
month = month(date)
) %>% group_by(year, month) %>% summarise(
date = last(date),
month.close = last(index), .groups = 'drop',
) %>% mutate(
month.change = log(month.close / lag(month.close))
)
# year month date month.close month.change
# <int> <int> <date> <dbl> <dbl>
#1 2002 12 2002-12-31 399. NA
#2 2003 1 2003-01-31 380. -0.0510
#3 2003 2 2003-02-28 381. 0.00257
#4 2003 3 2003-03-01 381. 0.000673
对于第二步,因为您的数据仅按一个键分组,所以它在 summarise
之后被删除,您将获得预期的输出。