在 R 中的长时间序列数据集中保留前一年未发生的观察结果

Question

我有一个看起来像这样的 df:

df 包含 2009-2019 年，并过滤了居住在一个特定城镇的个人，这些人在该特定年份年龄在 18-64 岁之间。

对于每一年，我只需要保留那一年搬进这个城镇的人。因此，例如，我需要保留 2010 年的人口减去 2009 年的人口之间的差异。我还需要每年都这样做（例如，有些人搬出城镇几年然后return - ID 5 就是一个例子）。最后，我想要 2010-2019 年的每一年都有一个 df，所以十个 df 只包含那一年搬进城里的人。

我试过 group_by() 和 left_join()，但没能成功。一定有一个简单的解决方案，但我还没有找到。

Answer 1

您可以使用setdiff函数来执行set(A) - set(B)操作。按年份将数据拆分为数据帧，然后遍历它们，找到新的加入者。

示例代码：

library(dplyr)
set.seed(123)
df <- tibble(
    id = c(1, 2, 3, 4, 5,     # first year
           1, 2, 3, 5, 6, 7,  # 4 moves out, 6,7 move in
           2, 3, 4, 6, 7, 8), # 1,5 moves out, 4,8 move in
    year = c(rep(2009, 5), 
             rep(2010, 6), 
             rep(2011, 6)), 
    age = sample(18:64, size = 17) # extra column
)

# split into list of dataframes by year
df_by_year <- split(df, df$year)

# create a list to contain the 2 df (total years 3 - 1)
df_list <- vector("list", 2)

for(i in 1:length(df_list)){

    # determine incoming new people        
    new_joinees <- setdiff(df_by_year[[i+1]]$id, df_by_year[[i]]$id)

    # filter for above IDs
    df_list[[i]] <- dplyr::filter(df_by_year[[i+1]], id %in% new_joinees)
    
}

在 R 中的长时间序列数据集中保留前一年未发生的观察结果

Retain observations that hasn't occured in the year before in a long time-series dataset in R

r

data-manipulation

dataframe