过滤不均匀的数据集

Filtering uneven data sets

我正在尝试将一个数据集过滤成两个月。我想过滤掉有数据的 IDyear,并删除没有关联对的 IDyear

例如,如果 IDyear 在数据集中同时包含一月和七月,我想包括此 IDyear在我过滤的数据中。如果 ID 只有一月而不是七月,我想删除此数据而不将其包含在过滤后的数据集中。有没有好的方法来做到这一点?请注意,我不确定如何模拟示例中的不均匀数据集。

过滤出我想要的输出后,我通过为每个季节性月份创建一个列表进行测试,其中每个 IDyear 至少有 15 行与之关联。

library(lubridate)
library(dplyr)
set.seed(12345)

df <- tibble(
  date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"), 
  1000, replace = TRUE), 
  x = runif(length(date), min = 60000, max = 80000),
  y = runif(length(date), min = 800000, max = 900000),
  ID = rep(1:5, 200),
  month = month(date),
  year  =year(date)) %>% 
  arrange(ID, date)

df %>%
  filter(month %in% c(1,7)) %>% 
  group_by(ID, year) %>% 
  mutate(complete = length(unique(month)) == 2) %>%
  group_by(ID) %>% 
  filter(all(complete)) %>%
  group_by(ID, year) 

# Creates a list for each year and by ID
summer_list <- df %>% 
  filter(month %in% 7) %>% 
  filter(n() >= 15) %>% 
  group_split(year, ID)

# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list, 
                             function(x) paste(x$ID[1], 
                                               x$year[1], sep = '_'))

# Creates a list for each year and by ID
winter_list <- df1 %>% 
  filter(month %in% 1) %>% 
  filter(n() >= 15) %>% 
  group_split(year, ID)

# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list, 
                             function(x) paste(x$ID[1], 
                                               x$year[1], sep = '_'))

你们真的很亲密。我认为您的过滤器可以简化为以下内容。请务必将其保存到 df.

df <- df %>%
  filter(month %in% c(1,7)) %>% 
  group_by(ID, year) %>% 
  mutate(complete = length(unique(month)) == 2) %>%
  filter(complete)
  # could add "%>% select(-c(complete))" to get rid of complete

summer_listwinter_list 上,在过滤器之间添加一个 group_by。使用您提供的数据集,没有包含 15 条记录的组,但我通过增加 df 的大小来测试它是否有效,直到我得到一些。

summer_list <- df %>% 
  filter(month == 7) %>%   # used == since there's only one test value
  group_by(ID, year) %>%   # added this
  filter(n() >= 15) %>%
  group_split() 

您第一次使用 winter_list 时还有一个拼写错误 -- 输入数据是 df1,但我认为您需要 df。希望这有效!

这是完整的代码,包括较大的df

library(lubridate)
library(dplyr)
set.seed(12345)

df <- tibble(
  date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"), 
    4000, replace = TRUE), 
  x = runif(length(date), min = 60000, max = 80000),
  y = runif(length(date), min = 800000, max = 900000),
  ID = rep(1:5, 800),
  month = month(date),
  year  =year(date)) %>% 
  arrange(ID, date)

df <- df %>%
  filter(month %in% c(1,7)) %>% 
  group_by(ID, year) %>% 
  mutate(complete = length(unique(month)) == 2) %>%
  filter(complete)
  # could add "%>% select(-c(complete))" to get rid of complete

# Creates a list for each year and by ID
summer_list <- df %>% 
  filter(month == 7) %>% 
  group_by(ID, year) %>% 
  filter(n() >= 15) %>%
  group_split()

# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list, 
  function(x) paste(x$ID[1], 
    x$year[1], sep = '_'))

# Creates a list for each year and by ID
winter_list <- df %>% 
  filter(month == 1) %>% 
  group_by(ID, year) %>% 
  filter(n() >= 15) %>% 
  group_split()

# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list, 
  function(x) paste(x$ID[1], 
    x$year[1], sep = '_'))