选择夏季和冬季的年份
Selecting years with both summer and winter
我有一个跨越多年的夏季和冬季的数据集。我刚刚意识到,当我按冬季和夏季对它们进行子集化时,我最终得到的冬季比夏季多。我认为问题在于我的数据在夏季开始时结束或在冬季结束时结束。
我是否可以设置一个参数,以便我只选择既有夏季又有冬季的年份?
library(lubridate)
library(tidyverse)
date <- rep_len(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),1000)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$month <- month(df$date)
df$year <- year(df$date)
df1 <- df %>%
mutate(season_categ = case_when(month %in% 6:8 ~ 'summer',
month %in% 1:3 ~ 'winter')) %>%
group_by(ID, year, season_categ)
summer_list <- df1 %>%
group_by(ID, year)%>%
filter(season_categ == "summer") %>%
group_split()
winter_list <- df1 %>%
group_by(ID, year) %>%
filter(season_categ == "winter") %>%
group_split()
应该这样做:
df1 %>%
group_by(year) %>%
filter(any(season_categ == "winter") &
any(season_categ == "summer"))
为了测试它,我们可以先从 2010 年(例如)中删除冬季月份以获得不完整的年份:
df1 %>%
filter(!(year == 2010 & season_categ == "winter")) %>%
group_by(year) %>%
filter(any(season_categ == "winter") &
any(season_categ == "summer"))
#> # A tibble: 635 x 7
#> # Groups: year [2]
#> date x y ID month year season_categ
#> <date> <dbl> <dbl> <int> <dbl> <dbl> <chr>
#> 1 2011-01-01 69169. 880856. 1 1 2011 winter
#> 2 2011-01-02 62891. 869748. 2 1 2011 winter
#> 3 2011-01-03 64951. 851220. 3 1 2011 winter
#> 4 2011-01-04 77424. 844041. 4 1 2011 winter
#> 5 2011-01-05 75827. 861533. 5 1 2011 winter
#> 6 2011-01-06 72937. 830014. 1 1 2011 winter
#> 7 2011-01-07 60130. 830369. 2 1 2011 winter
#> 8 2011-01-08 79719. 812852. 3 1 2011 winter
#> 9 2011-01-09 60300. 845120. 4 1 2011 winter
#> 10 2011-01-10 62817. 879759. 5 1 2011 winter
#> # … with 625 more rows
这与 df1 %>% filter(year != 2010)
(对于我的约会)相同,这意味着它有效。
我试过了。
- 有两个季节的年份,平均月数(如果按年计算)大于 3 但小于 6(这是您创建该列的方式)。
df1 %>%
group_by(ID, year) %>%
filter(season_categ == "summer" &
mean(month[season_categ %in% c('summer', 'winter')]) > 3 & mean(month[season_categ %in% c('summer', 'winter')]) < 6)
这应该会给你想要的结果。但也许问题出在其他地方,比如你可能有不同的月数,例如1,2, 6,7,8 有两个季节,但两个子集中的行数不同。
部分取决于您使用的是气象季节还是天文季节 (https://www.almanac.com/content/first-day-seasons)。或者,如果您想按天而不是按月推进一个季节。这是一个允许您这样做的建议。
seasons <- as.Date(paste0("2021-", c("01-01", "03-01", "06-01", "09-01", "12-01")))
seasons <- as.POSIXlt(seasons)$yday
seasons <- setNames(seasons, c("winter", "spring", "summer", "fall", "winter"))
seasons
# winter spring summer fall winter
# 0 59 151 243 334
library(dplyr)
library(lubridate)
as_tibble(df) %>%
mutate(
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))]
) %>%
sample_n(10)
# # A tibble: 10 x 6
# date x y ID yday season
# <date> <dbl> <dbl> <int> <int> <chr>
# 1 2010-02-13 79471. 862415. 4 43 winter
# 2 2010-10-23 61796. 813429. 1 295 fall
# 3 2011-12-09 65958. 808064. 3 342 winter
# 4 2010-05-29 65872. 841309. 4 148 spring
# 5 2010-03-07 63789. 869548. 1 65 spring
# 6 2012-02-12 67605. 859081. 3 42 winter
# 7 2011-04-19 79034. 883803. 4 108 spring
# 8 2011-03-16 69658. 832297. 5 74 spring
# 9 2011-12-29 68793. 881267. 3 362 winter
# 10 2012-04-24 70784. 805323. 5 114 spring
从这里开始,让我们筛选至少有两个季节之一的年份。不幸的是,我们的数据分布相当均匀,
as_tibble(df) %>%
mutate(
year = year(date),
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))],
) %>%
group_by(year) %>%
count(season) %>%
tidyr::pivot_wider(year, names_from = season, values_from = n)
# # A tibble: 3 x 5
# # Groups: year [3]
# year fall spring summer winter
# <dbl> <int> <int> <int> <int>
# 1 2010 91 92 92 90
# 2 2011 91 92 92 90
# 3 2012 28 92 92 58
所以我们不知道它是否真的在做我们想要的。我会人为地删除一些数据来测试我们想要的逻辑:
as_tibble(df) %>%
mutate(
year = year(date),
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))],
) %>%
filter(year > 2010 | season %in% c("fall", "winter")) %>% # artificial, for testing
group_by(year) %>%
count(season) %>%
tidyr::pivot_wider(year, names_from = season, values_from = n)
# # A tibble: 3 x 5
# # Groups: year [3]
# year fall winter spring summer
# <dbl> <int> <int> <int> <int>
# 1 2010 91 90 NA NA
# 2 2011 91 90 92 92
# 3 2012 28 58 92 92
从这里开始,我们添加一个分组过滤器:
as_tibble(df) %>%
mutate(
year = year(date),
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))],
) %>%
filter(year > 2010 | season %in% c("fall", "winter")) %>% # artificial, for testing
group_by(year) %>%
filter(all(c("winter", "summer") %in% season)) %>% # this is the new line
sample_n(10)
# # A tibble: 20 x 7
# # Groups: year [2]
# date x y ID year yday season
# <date> <dbl> <dbl> <int> <dbl> <dbl> <chr>
# 1 2011-02-13 61686. 815664. 4 2011 44 winter
# 2 2011-10-23 75448. 849477. 1 2011 296 fall
# 3 2011-07-15 75901. 840969. 1 2011 196 summer
# 4 2011-05-29 66108. 811565. 4 2011 149 spring
# 5 2011-03-07 70298. 831304. 1 2011 66 spring
# 6 2011-09-18 73951. 875712. 1 2011 261 fall
# 7 2011-08-04 64917. 860239. 1 2011 216 summer
# 8 2011-11-29 78909. 802692. 3 2011 333 fall
# 9 2011-01-07 66441. 868062. 2 2011 7 winter
# 10 2011-06-16 64583. 889124. 2 2011 167 summer
# 11 2012-05-09 78725. 862934. 5 2012 130 spring
# 12 2012-08-12 67767. 871229. 5 2012 225 summer
# 13 2012-06-28 62354. 898829. 5 2012 180 summer
# 14 2012-05-26 62373. 819059. 2 2012 147 spring
# 15 2012-06-21 68019. 896370. 3 2012 173 summer
# 16 2012-01-22 61753. 872778. 2 2012 22 winter
# 17 2012-03-18 64490. 810292. 3 2012 78 spring
# 18 2012-08-15 76048. 875765. 3 2012 228 summer
# 19 2012-09-15 65386. 885431. 4 2012 259 fall
# 20 2012-04-19 60072. 895292. 5 2012 110 spring
(2010 年没有,我们期望的。)
数据
我在随机数据中使用了可重复性种子:
set.seed(42)
date <- rep_len(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),1000)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
head(df)
# date x y ID
# 1 2010-01-01 78296.1209 884829.322 1
# 2 2010-01-02 78741.5083 806274.633 2
# 3 2010-01-03 65722.7907 881984.509 3
# 4 2010-01-04 76608.9525 853936.029 4
# 5 2010-01-05 72834.9104 849902.010 5
# 6 2010-01-06 70381.9190 802222.732 1
我有一个跨越多年的夏季和冬季的数据集。我刚刚意识到,当我按冬季和夏季对它们进行子集化时,我最终得到的冬季比夏季多。我认为问题在于我的数据在夏季开始时结束或在冬季结束时结束。
我是否可以设置一个参数,以便我只选择既有夏季又有冬季的年份?
library(lubridate)
library(tidyverse)
date <- rep_len(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),1000)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$month <- month(df$date)
df$year <- year(df$date)
df1 <- df %>%
mutate(season_categ = case_when(month %in% 6:8 ~ 'summer',
month %in% 1:3 ~ 'winter')) %>%
group_by(ID, year, season_categ)
summer_list <- df1 %>%
group_by(ID, year)%>%
filter(season_categ == "summer") %>%
group_split()
winter_list <- df1 %>%
group_by(ID, year) %>%
filter(season_categ == "winter") %>%
group_split()
应该这样做:
df1 %>%
group_by(year) %>%
filter(any(season_categ == "winter") &
any(season_categ == "summer"))
为了测试它,我们可以先从 2010 年(例如)中删除冬季月份以获得不完整的年份:
df1 %>%
filter(!(year == 2010 & season_categ == "winter")) %>%
group_by(year) %>%
filter(any(season_categ == "winter") &
any(season_categ == "summer"))
#> # A tibble: 635 x 7
#> # Groups: year [2]
#> date x y ID month year season_categ
#> <date> <dbl> <dbl> <int> <dbl> <dbl> <chr>
#> 1 2011-01-01 69169. 880856. 1 1 2011 winter
#> 2 2011-01-02 62891. 869748. 2 1 2011 winter
#> 3 2011-01-03 64951. 851220. 3 1 2011 winter
#> 4 2011-01-04 77424. 844041. 4 1 2011 winter
#> 5 2011-01-05 75827. 861533. 5 1 2011 winter
#> 6 2011-01-06 72937. 830014. 1 1 2011 winter
#> 7 2011-01-07 60130. 830369. 2 1 2011 winter
#> 8 2011-01-08 79719. 812852. 3 1 2011 winter
#> 9 2011-01-09 60300. 845120. 4 1 2011 winter
#> 10 2011-01-10 62817. 879759. 5 1 2011 winter
#> # … with 625 more rows
这与 df1 %>% filter(year != 2010)
(对于我的约会)相同,这意味着它有效。
我试过了。
- 有两个季节的年份,平均月数(如果按年计算)大于 3 但小于 6(这是您创建该列的方式)。
df1 %>%
group_by(ID, year) %>%
filter(season_categ == "summer" &
mean(month[season_categ %in% c('summer', 'winter')]) > 3 & mean(month[season_categ %in% c('summer', 'winter')]) < 6)
这应该会给你想要的结果。但也许问题出在其他地方,比如你可能有不同的月数,例如1,2, 6,7,8 有两个季节,但两个子集中的行数不同。
部分取决于您使用的是气象季节还是天文季节 (https://www.almanac.com/content/first-day-seasons)。或者,如果您想按天而不是按月推进一个季节。这是一个允许您这样做的建议。
seasons <- as.Date(paste0("2021-", c("01-01", "03-01", "06-01", "09-01", "12-01")))
seasons <- as.POSIXlt(seasons)$yday
seasons <- setNames(seasons, c("winter", "spring", "summer", "fall", "winter"))
seasons
# winter spring summer fall winter
# 0 59 151 243 334
library(dplyr)
library(lubridate)
as_tibble(df) %>%
mutate(
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))]
) %>%
sample_n(10)
# # A tibble: 10 x 6
# date x y ID yday season
# <date> <dbl> <dbl> <int> <int> <chr>
# 1 2010-02-13 79471. 862415. 4 43 winter
# 2 2010-10-23 61796. 813429. 1 295 fall
# 3 2011-12-09 65958. 808064. 3 342 winter
# 4 2010-05-29 65872. 841309. 4 148 spring
# 5 2010-03-07 63789. 869548. 1 65 spring
# 6 2012-02-12 67605. 859081. 3 42 winter
# 7 2011-04-19 79034. 883803. 4 108 spring
# 8 2011-03-16 69658. 832297. 5 74 spring
# 9 2011-12-29 68793. 881267. 3 362 winter
# 10 2012-04-24 70784. 805323. 5 114 spring
从这里开始,让我们筛选至少有两个季节之一的年份。不幸的是,我们的数据分布相当均匀,
as_tibble(df) %>%
mutate(
year = year(date),
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))],
) %>%
group_by(year) %>%
count(season) %>%
tidyr::pivot_wider(year, names_from = season, values_from = n)
# # A tibble: 3 x 5
# # Groups: year [3]
# year fall spring summer winter
# <dbl> <int> <int> <int> <int>
# 1 2010 91 92 92 90
# 2 2011 91 92 92 90
# 3 2012 28 92 92 58
所以我们不知道它是否真的在做我们想要的。我会人为地删除一些数据来测试我们想要的逻辑:
as_tibble(df) %>%
mutate(
year = year(date),
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))],
) %>%
filter(year > 2010 | season %in% c("fall", "winter")) %>% # artificial, for testing
group_by(year) %>%
count(season) %>%
tidyr::pivot_wider(year, names_from = season, values_from = n)
# # A tibble: 3 x 5
# # Groups: year [3]
# year fall winter spring summer
# <dbl> <int> <int> <int> <int>
# 1 2010 91 90 NA NA
# 2 2011 91 90 92 92
# 3 2012 28 58 92 92
从这里开始,我们添加一个分组过滤器:
as_tibble(df) %>%
mutate(
year = year(date),
yday = yday(date),
season = names(seasons)[findInterval(yday, c(seasons, Inf))],
) %>%
filter(year > 2010 | season %in% c("fall", "winter")) %>% # artificial, for testing
group_by(year) %>%
filter(all(c("winter", "summer") %in% season)) %>% # this is the new line
sample_n(10)
# # A tibble: 20 x 7
# # Groups: year [2]
# date x y ID year yday season
# <date> <dbl> <dbl> <int> <dbl> <dbl> <chr>
# 1 2011-02-13 61686. 815664. 4 2011 44 winter
# 2 2011-10-23 75448. 849477. 1 2011 296 fall
# 3 2011-07-15 75901. 840969. 1 2011 196 summer
# 4 2011-05-29 66108. 811565. 4 2011 149 spring
# 5 2011-03-07 70298. 831304. 1 2011 66 spring
# 6 2011-09-18 73951. 875712. 1 2011 261 fall
# 7 2011-08-04 64917. 860239. 1 2011 216 summer
# 8 2011-11-29 78909. 802692. 3 2011 333 fall
# 9 2011-01-07 66441. 868062. 2 2011 7 winter
# 10 2011-06-16 64583. 889124. 2 2011 167 summer
# 11 2012-05-09 78725. 862934. 5 2012 130 spring
# 12 2012-08-12 67767. 871229. 5 2012 225 summer
# 13 2012-06-28 62354. 898829. 5 2012 180 summer
# 14 2012-05-26 62373. 819059. 2 2012 147 spring
# 15 2012-06-21 68019. 896370. 3 2012 173 summer
# 16 2012-01-22 61753. 872778. 2 2012 22 winter
# 17 2012-03-18 64490. 810292. 3 2012 78 spring
# 18 2012-08-15 76048. 875765. 3 2012 228 summer
# 19 2012-09-15 65386. 885431. 4 2012 259 fall
# 20 2012-04-19 60072. 895292. 5 2012 110 spring
(2010 年没有,我们期望的。)
数据
我在随机数据中使用了可重复性种子:
set.seed(42)
date <- rep_len(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),1000)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
head(df)
# date x y ID
# 1 2010-01-01 78296.1209 884829.322 1
# 2 2010-01-02 78741.5083 806274.633 2
# 3 2010-01-03 65722.7907 881984.509 3
# 4 2010-01-04 76608.9525 853936.029 4
# 5 2010-01-05 72834.9104 849902.010 5
# 6 2010-01-06 70381.9190 802222.732 1