dplyr:在标记时间段之前的几年内过滤
dplyr: Filter within years prior to marked time period
我想要一份特定于年份国家/地区的假人列表,我还想在标记的那些年份之前标记年份 两年。
数据是这样的
library(tidyverse)
df <- tribble(
~year, ~country, ~occurrence,
#--|--|----
2003, "USA", 1,
2004, "USA", 0,
2005, "USA", 0,
2006, "USA", 0,
2007, "USA", 0,
2008, "USA", 0,
2009, "USA", 0,
2010, "USA", 0,
2011, "USA", 1,
2012, "USA", 0,
2013, "USA", 0,
2005, "FRA", 0,
2006, "FRA", 0,
2007, "FRA", 1,
2008, "FRA", 1,
2009, "FRA", 0,
2010, "FRA", 0,
2011, "FRA", 0,
2012, "FRA", 0,
2013, "FRA", 0,
2014, "FRA", 0,
2015, "FRA", 1
)
因此,对于 "USA"
,我还想将 1
放入 2009 年和 2010 年的 occurence
列以及 FRA
2005、2006、2013 年的列中和 2014 年。
我想过这样做:
df %>%
group_by(country) %>%
mutate(occurence = ifelse("not sure what to put here"),
1,
0))
但我不确定如何告诉 R 只过滤我想要的年份。
按'country'分组后,我们最多可以取'occurrence'中的2个lead
,然后用pmax
得到每一行的max
得到'occurrence'
中的预期输出
df %>%
group_by(country) %>%
mutate(occurrence = pmax(occurrence, lead(occurrence, default = 0),
lead(occurrence, default=0, n=2)))
或者这可以通过 data.table
和类似的方法来实现
library(data.table)
setDT(df)[, occurrence := do.call(pmax, shift(occurrence, n = 0:2,
type = "lead", fill = 0)), country]
df
# year country occurrence
# 1: 2003 USA 1
# 2: 2004 USA 0
# 3: 2005 USA 0
# 4: 2006 USA 0
# 5: 2007 USA 0
# 6: 2008 USA 0
# 7: 2009 USA 1
# 8: 2010 USA 1
# 9: 2011 USA 1
#10: 2012 USA 0
#11: 2013 USA 0
#12: 2005 FRA 1
#13: 2006 FRA 1
#14: 2007 FRA 1
#15: 2008 FRA 1
#16: 2009 FRA 0
#17: 2010 FRA 0
#18: 2011 FRA 0
#19: 2012 FRA 0
#20: 2013 FRA 1
#21: 2014 FRA 1
#22: 2015 FRA 1
这是另一个 dplyr 解决方案:
df %>%
group_by(country) %>%
mutate(
occurrence=ifelse( lead(occurrence, 1) %in% 1 |
lead(occurrence, 2) %in% 1,
1, occurrence)
)
# A tibble: 22 x 3
# Groups: country [2]
year country occurrence
<dbl> <chr> <dbl>
1 2003 USA 1
2 2004 USA 0
3 2005 USA 0
4 2006 USA 0
5 2007 USA 0
6 2008 USA 0
7 2009 USA 1
8 2010 USA 1
9 2011 USA 1
10 2012 USA 0
11 2013 USA 0
12 2005 FRA 1
13 2006 FRA 1
14 2007 FRA 1
15 2008 FRA 1
16 2009 FRA 0
17 2010 FRA 0
18 2011 FRA 0
19 2012 FRA 0
20 2013 FRA 1
21 2014 FRA 1
22 2015 FRA 1
使用lead(occurrence, 1) %in% 1
代替lead(occurrence, 1) == 1
,因为后者无法处理NA
.
我想要一份特定于年份国家/地区的假人列表,我还想在标记的那些年份之前标记年份 两年。
数据是这样的
library(tidyverse)
df <- tribble(
~year, ~country, ~occurrence,
#--|--|----
2003, "USA", 1,
2004, "USA", 0,
2005, "USA", 0,
2006, "USA", 0,
2007, "USA", 0,
2008, "USA", 0,
2009, "USA", 0,
2010, "USA", 0,
2011, "USA", 1,
2012, "USA", 0,
2013, "USA", 0,
2005, "FRA", 0,
2006, "FRA", 0,
2007, "FRA", 1,
2008, "FRA", 1,
2009, "FRA", 0,
2010, "FRA", 0,
2011, "FRA", 0,
2012, "FRA", 0,
2013, "FRA", 0,
2014, "FRA", 0,
2015, "FRA", 1
)
因此,对于 "USA"
,我还想将 1
放入 2009 年和 2010 年的 occurence
列以及 FRA
2005、2006、2013 年的列中和 2014 年。
我想过这样做:
df %>%
group_by(country) %>%
mutate(occurence = ifelse("not sure what to put here"),
1,
0))
但我不确定如何告诉 R 只过滤我想要的年份。
按'country'分组后,我们最多可以取'occurrence'中的2个lead
,然后用pmax
得到每一行的max
得到'occurrence'
df %>%
group_by(country) %>%
mutate(occurrence = pmax(occurrence, lead(occurrence, default = 0),
lead(occurrence, default=0, n=2)))
或者这可以通过 data.table
和类似的方法来实现
library(data.table)
setDT(df)[, occurrence := do.call(pmax, shift(occurrence, n = 0:2,
type = "lead", fill = 0)), country]
df
# year country occurrence
# 1: 2003 USA 1
# 2: 2004 USA 0
# 3: 2005 USA 0
# 4: 2006 USA 0
# 5: 2007 USA 0
# 6: 2008 USA 0
# 7: 2009 USA 1
# 8: 2010 USA 1
# 9: 2011 USA 1
#10: 2012 USA 0
#11: 2013 USA 0
#12: 2005 FRA 1
#13: 2006 FRA 1
#14: 2007 FRA 1
#15: 2008 FRA 1
#16: 2009 FRA 0
#17: 2010 FRA 0
#18: 2011 FRA 0
#19: 2012 FRA 0
#20: 2013 FRA 1
#21: 2014 FRA 1
#22: 2015 FRA 1
这是另一个 dplyr 解决方案:
df %>%
group_by(country) %>%
mutate(
occurrence=ifelse( lead(occurrence, 1) %in% 1 |
lead(occurrence, 2) %in% 1,
1, occurrence)
)
# A tibble: 22 x 3
# Groups: country [2]
year country occurrence
<dbl> <chr> <dbl>
1 2003 USA 1
2 2004 USA 0
3 2005 USA 0
4 2006 USA 0
5 2007 USA 0
6 2008 USA 0
7 2009 USA 1
8 2010 USA 1
9 2011 USA 1
10 2012 USA 0
11 2013 USA 0
12 2005 FRA 1
13 2006 FRA 1
14 2007 FRA 1
15 2008 FRA 1
16 2009 FRA 0
17 2010 FRA 0
18 2011 FRA 0
19 2012 FRA 0
20 2013 FRA 1
21 2014 FRA 1
22 2015 FRA 1
使用lead(occurrence, 1) %in% 1
代替lead(occurrence, 1) == 1
,因为后者无法处理NA
.