R:基于因子水平和年份的条件聚合

R: conditional aggregate based on factor level and year

我在 R 中有一个数据集,我试图按列级别和年份进行聚合,如下所示:

    City  State   Year   Status      Year_repealed   PolicyNo
    Pitt   PA     2001   InForce                        6
    Phil.  PA     2001   Repealed        2004           9
    Pitt   PA     2002   InForce                        7
    Pitt   PA     2005   InForce                        2

我想创建的是,对于每一年,我在考虑政策废除日期的情况下汇总各州的 PolicyNo。我得到的结果是:

    Year    State PolicyNo
    2001     PA     15  
    2002     PA     22
    2003     PA     22
    2004     PA     12 
    2005     PA     14

我不确定如何以废除数据为条件拆分和聚合数据,并且想知道是否有一种方法可以轻松地实现这一点是 R。

这可能会帮助您将其分解为两个不同的问题。

  1. 得到一个 table 显示每个城邦年份中 PolicyNo 的变化。
  2. 总结 table 以显示每个州年的保单编号。

为了完成 (1),我们使用 NA PolicyNo 添加缺失的年份,并将废除添加为负面 PolicyNo 观察。

library(dplyr)

df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))

repeals = df %>%
  filter(!is.na(Year_repealed)) %>%
  mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
#    City State Year   Status Year_repealed PolicyNo
# 1 Phil.    PA 2004 Repealed          2004       -9

all_years = expand.grid(City = unique(df$City), State = unique(df$State),
                        Year = 2001:2005)

df = bind_rows(df, repeals, all_years)
#     City State Year   Status Year_repealed PolicyNo
# 1   Pitt    PA 2001  InForce            NA        6
# 2  Phil.    PA 2001 Repealed          2004        9
# 3   Pitt    PA 2002  InForce            NA        7
# 4   Pitt    PA 2005  InForce            NA        2
# 5  Phil.    PA 2004 Repealed          2004       -9
# 6   Pitt    PA 2001     <NA>            NA       NA
# 7  Phil.    PA 2001     <NA>            NA       NA
# 8   Pitt    PA 2002     <NA>            NA       NA
# 9  Phil.    PA 2002     <NA>            NA       NA
# 10  Pitt    PA 2003     <NA>            NA       NA
# 11 Phil.    PA 2003     <NA>            NA       NA
# 12  Pitt    PA 2004     <NA>            NA       NA
# 13 Phil.    PA 2004     <NA>            NA       NA
# 14  Pitt    PA 2005     <NA>            NA       NA
# 15 Phil.    PA 2005     <NA>            NA       NA

现在 table 显示每个城邦年份并包含废除。这是我们可以总结的table

df = df %>%
  group_by(Year, State) %>%
  summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
# 
#    Year State annual_change
#   <int> <chr>         <dbl>
# 1  2001    PA            15
# 2  2002    PA             7
# 3  2003    PA             0
# 4  2004    PA            -9
# 5  2005    PA             2

这让我们的政策在每个州年都没有变化。变化的累积总和使我们水平。

df = df %>%
  ungroup() %>%
  mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
#    Year State annual_change PolicyNo
#   <int> <chr>         <dbl>    <dbl>
# 1  2001    PA            15       15
# 2  2002    PA             7       22
# 3  2003    PA             0       22
# 4  2004    PA            -9       13
# 5  2005    PA             2       15

使用 data.table 包,您可以按如下方式进行:

melt(setDT(dat), 
     measure.vars = c(3,5),
     value.name = 'Year',
     value.factor = FALSE)[!is.na(Year)
                           ][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
                             ][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
                               ][is.na(PolicyNo), PolicyNo := 0
                                 ][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
                                   ][, .(Year, State, PolicyNo = cumsum(PolicyNo))]

以上代码的结果:

   Year State PolicyNo
1: 2001    PA       15
2: 2002    PA       22
3: 2003    PA       22
4: 2004    PA       13
5: 2005    PA       15

如您所见,需要几个步骤才能达到预期的最终结果:

  • 首先转换为 data.table (setDT(dat)) 并将其重塑为长格式并删除没有 Year
  • 的行
  • 然后将具有 'Year_repealed' 的行的值设置为负值。
  • 通过交叉连接 (CJ),您可以确保每个州的所有年份都存在,并将 PolicyNo 列中的 NA 值转换为零。
  • 最后,你按年份汇总,然后对结果进行累加。