过滤不均匀的数据集

Question

我正在尝试将一个数据集过滤成两个月。我想过滤掉有数据的 ID 和 year，并删除没有关联对的 ID 和 year。

例如，如果 ID 和 year 在数据集中同时包含一月和七月，我想包括此 ID 和 year在我过滤的数据中。如果 ID 只有一月而不是七月，我想删除此数据而不将其包含在过滤后的数据集中。有没有好的方法来做到这一点？请注意，我不确定如何模拟示例中的不均匀数据集。

过滤出我想要的输出后，我通过为每个季节性月份创建一个列表进行测试，其中每个 ID 和 year 至少有 15 行与之关联。

library(lubridate)
library(dplyr)
set.seed(12345)

df <- tibble(
  date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"), 
  1000, replace = TRUE), 
  x = runif(length(date), min = 60000, max = 80000),
  y = runif(length(date), min = 800000, max = 900000),
  ID = rep(1:5, 200),
  month = month(date),
  year  =year(date)) %>% 
  arrange(ID, date)

df %>%
  filter(month %in% c(1,7)) %>% 
  group_by(ID, year) %>% 
  mutate(complete = length(unique(month)) == 2) %>%
  group_by(ID) %>% 
  filter(all(complete)) %>%
  group_by(ID, year) 

# Creates a list for each year and by ID
summer_list <- df %>% 
  filter(month %in% 7) %>% 
  filter(n() >= 15) %>% 
  group_split(year, ID)

# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list, 
                             function(x) paste(x$ID[1], 
                                               x$year[1], sep = '_'))

# Creates a list for each year and by ID
winter_list <- df1 %>% 
  filter(month %in% 1) %>% 
  filter(n() >= 15) %>% 
  group_split(year, ID)

# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list, 
                             function(x) paste(x$ID[1], 
                                               x$year[1], sep = '_'))

Answer 1

你们真的很亲密。我认为您的过滤器可以简化为以下内容。请务必将其保存到 df.

df <- df %>%
  filter(month %in% c(1,7)) %>% 
  group_by(ID, year) %>% 
  mutate(complete = length(unique(month)) == 2) %>%
  filter(complete)
  # could add "%>% select(-c(complete))" to get rid of complete

在 summer_list 和 winter_list 上，在过滤器之间添加一个 group_by。使用您提供的数据集，没有包含 15 条记录的组，但我通过增加 df 的大小来测试它是否有效，直到我得到一些。

summer_list <- df %>% 
  filter(month == 7) %>%   # used == since there's only one test value
  group_by(ID, year) %>%   # added this
  filter(n() >= 15) %>%
  group_split()

您第一次使用 winter_list 时还有一个拼写错误 -- 输入数据是 df1，但我认为您需要 df。希望这有效！

这是完整的代码，包括较大的df：

library(lubridate)
library(dplyr)
set.seed(12345)

df <- tibble(
  date = sample(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"), 
    4000, replace = TRUE), 
  x = runif(length(date), min = 60000, max = 80000),
  y = runif(length(date), min = 800000, max = 900000),
  ID = rep(1:5, 800),
  month = month(date),
  year  =year(date)) %>% 
  arrange(ID, date)

df <- df %>%
  filter(month %in% c(1,7)) %>% 
  group_by(ID, year) %>% 
  mutate(complete = length(unique(month)) == 2) %>%
  filter(complete)
  # could add "%>% select(-c(complete))" to get rid of complete

# Creates a list for each year and by ID
summer_list <- df %>% 
  filter(month == 7) %>% 
  group_by(ID, year) %>% 
  filter(n() >= 15) %>%
  group_split()

# Renames the names in the list to AnimalID and year
names(summer_list) <- sapply(summer_list, 
  function(x) paste(x$ID[1], 
    x$year[1], sep = '_'))

# Creates a list for each year and by ID
winter_list <- df %>% 
  filter(month == 1) %>% 
  group_by(ID, year) %>% 
  filter(n() >= 15) %>% 
  group_split()

# Renames the names in the list to ID and year
names(winter_list) <- sapply(winter_list, 
  function(x) paste(x$ID[1], 
    x$year[1], sep = '_'))

过滤不均匀的数据集

Filtering uneven data sets

r

lubridate

dplyr

tidyverse

tibble