如何计算 R 中的 LAG 差异 - 如果周期发生变化或如何将空 NA 行添加到数据帧?
How to calculate LAG difference in R - if periods changes or how to add empty NA rows to dataframe?
如果我有以下数据框:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30)
)
并且想找到值的滞后,以便我生成以下数据帧:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30),
output = c(NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA, 10)
)
时间段不同 - 一个明显滞后的时间段是 5 - 那些是 Q1、Q2、Q3、Q4、END,然后是第二个时间段 - 那些是 year_priorEND vs year_aheadEND 甚至更远。
或者:
相反,如果没有网站的周期为 5(意味着 Q1、Q2、Q3 存在 5 个值, Q4,END) 然后生成该网站和期间的其余行,但值为 NA,因此可以生成如下内容:
tibble(
period = c("2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END"),
website = c(
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube"
),
values = c(NA,NA,NA,NA,1, NA,NA,NA,NA,2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, NA,NA,NA,NA,20, NA,NA,NA,NA,30),
output = c(NA,NA,NA,NA,NA,NA,NA,NA,NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA,NA,NA,NA,NA,NA,NA,NA,NA, 10)
)
因此,如果没有明确编码出哪些字段需要估算 - 我假设可以对每个组进行某种形式的检查?
因为在这种情况下我们可以只使用 output = lag(values, 5)
因为周期都是一致的
我觉得youtube 的输出应该是20?如果是这样,这是一个可能的解决方案,但是很笨拙:
library(tidyverse)
df <- tibble( period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c( "google", "google", "facebook",
"facebook", "facebook", "facebook",
"facebook", "facebook", "facebook",
"facebook", "facebook", "facebook",
"youtube", "youtube" ),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30) )
df %>%
extract(period, into = c("Year", "Period"), regex = "([0-9]*)(.*)") %>%
group_by(website) %>%
mutate(group_rank = match(Year, unique(Year)),
intstep = if_else(lag(Period) == "END" & group_rank == 2, lag(values), 0)) %>%
group_by(website, group_rank) %>%
mutate(outcome = if_else(group_rank == 2, first(intstep), as.double(NA))) %>%
ungroup() %>%
select(-c(group_rank,intstep))
#> # A tibble: 14 × 5
#> Year Period website values outcome
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2010 END google 1 NA
#> 2 2011 END google 2 1
#> 3 2010 Q1 facebook 1 NA
#> 4 2010 Q2 facebook 2 NA
#> 5 2010 Q3 facebook 3 NA
#> 6 2010 Q4 facebook 4 NA
#> 7 2010 END facebook 5 NA
#> 8 2011 Q1 facebook 6 5
#> 9 2011 Q2 facebook 7 5
#> 10 2011 Q3 facebook 8 5
#> 11 2011 Q4 facebook 9 5
#> 12 2011 END facebook 10 5
#> 13 2011 END youtube 20 NA
#> 14 2012 END youtube 30 20
这个管道可能会更整洁、更简单,我愿意接受建议。
如果我理解正确,OP 想要计算每个网站的某个时期和上一年同期的 year-on-year 差异。有一个相关的问题 ,其中 OP 明确要求计算 2011Q1 - 2010Q1 等等,包括 2011END - 2010END.
使用lag()
只有在时间序列完整且要滞后的位置数始终恒定的情况下才有效。对于给定的数据集,情况并非如此。
因此,我建议使用 更新 self-join:
library(data.table)
setDT(inp)[, c("year", "qtr") := tstrsplit(period, "(?<=^\d{4})", perl = TRUE,
type.convert = TRUE)][
, prior_year := year - 1L]
inp[inp, on = .(prior_year = year, qtr, website), output := x.values - i.values][]
period website values year qtr prior_year output
1: 2010END google 1 2010 END 2009 NA
2: 2011END google 2 2011 END 2010 1
3: 2010Q1 facebook 1 2010 Q1 2009 NA
4: 2010Q2 facebook 2 2010 Q2 2009 NA
5: 2010Q3 facebook 3 2010 Q3 2009 NA
6: 2010Q4 facebook 4 2010 Q4 2009 NA
7: 2010END facebook 5 2010 END 2009 NA
8: 2011Q1 facebook 6 2011 Q1 2010 5
9: 2011Q2 facebook 7 2011 Q2 2010 5
10: 2011Q3 facebook 8 2011 Q3 2010 5
11: 2011Q4 facebook 9 2011 Q4 2010 5
12: 2011END facebook 10 2011 END 2010 5
13: 2011END youtube 20 2011 END 2010 NA
14: 2012END youtube 30 2012 END 2011 10
说明
period
分为两部分,year
和 qtr
,year
被强制键入 integer
。用于拆分的 正则表达式 使用 正向回顾 zero-length 断言 在前四位数字之后立即拆分字符串。
prior_year
被计算并附加到数据集。
- 在更新self-join,对于相同的
website
和qtr
,year
与[=18=匹配].计算值的差异并将其附加为新列 output
.
如果我有以下数据框:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30)
)
并且想找到值的滞后,以便我生成以下数据帧:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c(
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube"
),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30),
output = c(NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA, 10)
)
时间段不同 - 一个明显滞后的时间段是 5 - 那些是 Q1、Q2、Q3、Q4、END,然后是第二个时间段 - 那些是 year_priorEND vs year_aheadEND 甚至更远。
或者:
相反,如果没有网站的周期为 5(意味着 Q1、Q2、Q3 存在 5 个值, Q4,END) 然后生成该网站和期间的其余行,但值为 NA,因此可以生成如下内容:
tibble(
period = c("2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END"),
website = c(
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"google",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"facebook",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube",
"youtube"
),
values = c(NA,NA,NA,NA,1, NA,NA,NA,NA,2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, NA,NA,NA,NA,20, NA,NA,NA,NA,30),
output = c(NA,NA,NA,NA,NA,NA,NA,NA,NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA,NA,NA,NA,NA,NA,NA,NA,NA, 10)
)
因此,如果没有明确编码出哪些字段需要估算 - 我假设可以对每个组进行某种形式的检查?
因为在这种情况下我们可以只使用 output = lag(values, 5)
因为周期都是一致的
我觉得youtube 的输出应该是20?如果是这样,这是一个可能的解决方案,但是很笨拙:
library(tidyverse)
df <- tibble( period = c("2010END", "2011END",
"2010Q1","2010Q2","2010Q3","2010Q4","2010END",
"2011Q1","2011Q2","2011Q3","2011Q4","2011END",
"2011END","2012END"),
website = c( "google", "google", "facebook",
"facebook", "facebook", "facebook",
"facebook", "facebook", "facebook",
"facebook", "facebook", "facebook",
"youtube", "youtube" ),
values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30) )
df %>%
extract(period, into = c("Year", "Period"), regex = "([0-9]*)(.*)") %>%
group_by(website) %>%
mutate(group_rank = match(Year, unique(Year)),
intstep = if_else(lag(Period) == "END" & group_rank == 2, lag(values), 0)) %>%
group_by(website, group_rank) %>%
mutate(outcome = if_else(group_rank == 2, first(intstep), as.double(NA))) %>%
ungroup() %>%
select(-c(group_rank,intstep))
#> # A tibble: 14 × 5
#> Year Period website values outcome
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2010 END google 1 NA
#> 2 2011 END google 2 1
#> 3 2010 Q1 facebook 1 NA
#> 4 2010 Q2 facebook 2 NA
#> 5 2010 Q3 facebook 3 NA
#> 6 2010 Q4 facebook 4 NA
#> 7 2010 END facebook 5 NA
#> 8 2011 Q1 facebook 6 5
#> 9 2011 Q2 facebook 7 5
#> 10 2011 Q3 facebook 8 5
#> 11 2011 Q4 facebook 9 5
#> 12 2011 END facebook 10 5
#> 13 2011 END youtube 20 NA
#> 14 2012 END youtube 30 20
这个管道可能会更整洁、更简单,我愿意接受建议。
如果我理解正确,OP 想要计算每个网站的某个时期和上一年同期的 year-on-year 差异。有一个相关的问题
使用lag()
只有在时间序列完整且要滞后的位置数始终恒定的情况下才有效。对于给定的数据集,情况并非如此。
因此,我建议使用 更新 self-join:
library(data.table)
setDT(inp)[, c("year", "qtr") := tstrsplit(period, "(?<=^\d{4})", perl = TRUE,
type.convert = TRUE)][
, prior_year := year - 1L]
inp[inp, on = .(prior_year = year, qtr, website), output := x.values - i.values][]
period website values year qtr prior_year output 1: 2010END google 1 2010 END 2009 NA 2: 2011END google 2 2011 END 2010 1 3: 2010Q1 facebook 1 2010 Q1 2009 NA 4: 2010Q2 facebook 2 2010 Q2 2009 NA 5: 2010Q3 facebook 3 2010 Q3 2009 NA 6: 2010Q4 facebook 4 2010 Q4 2009 NA 7: 2010END facebook 5 2010 END 2009 NA 8: 2011Q1 facebook 6 2011 Q1 2010 5 9: 2011Q2 facebook 7 2011 Q2 2010 5 10: 2011Q3 facebook 8 2011 Q3 2010 5 11: 2011Q4 facebook 9 2011 Q4 2010 5 12: 2011END facebook 10 2011 END 2010 5 13: 2011END youtube 20 2011 END 2010 NA 14: 2012END youtube 30 2012 END 2011 10
说明
period
分为两部分,year
和qtr
,year
被强制键入integer
。用于拆分的 正则表达式 使用 正向回顾 zero-length 断言 在前四位数字之后立即拆分字符串。prior_year
被计算并附加到数据集。- 在更新self-join,对于相同的
website
和qtr
,year
与[=18=匹配].计算值的差异并将其附加为新列output
.