R:基于因子水平和年份的条件聚合
R: conditional aggregate based on factor level and year
我在 R 中有一个数据集,我试图按列级别和年份进行聚合,如下所示:
City State Year Status Year_repealed PolicyNo
Pitt PA 2001 InForce 6
Phil. PA 2001 Repealed 2004 9
Pitt PA 2002 InForce 7
Pitt PA 2005 InForce 2
我想创建的是,对于每一年,我在考虑政策废除日期的情况下汇总各州的 PolicyNo。我得到的结果是:
Year State PolicyNo
2001 PA 15
2002 PA 22
2003 PA 22
2004 PA 12
2005 PA 14
我不确定如何以废除数据为条件拆分和聚合数据,并且想知道是否有一种方法可以轻松地实现这一点是 R。
这可能会帮助您将其分解为两个不同的问题。
- 得到一个 table 显示每个城邦年份中 PolicyNo 的变化。
- 总结 table 以显示每个州年的保单编号。
为了完成 (1),我们使用 NA
PolicyNo 添加缺失的年份,并将废除添加为负面 PolicyNo
观察。
library(dplyr)
df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
repeals = df %>%
filter(!is.na(Year_repealed)) %>%
mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
# City State Year Status Year_repealed PolicyNo
# 1 Phil. PA 2004 Repealed 2004 -9
all_years = expand.grid(City = unique(df$City), State = unique(df$State),
Year = 2001:2005)
df = bind_rows(df, repeals, all_years)
# City State Year Status Year_repealed PolicyNo
# 1 Pitt PA 2001 InForce NA 6
# 2 Phil. PA 2001 Repealed 2004 9
# 3 Pitt PA 2002 InForce NA 7
# 4 Pitt PA 2005 InForce NA 2
# 5 Phil. PA 2004 Repealed 2004 -9
# 6 Pitt PA 2001 <NA> NA NA
# 7 Phil. PA 2001 <NA> NA NA
# 8 Pitt PA 2002 <NA> NA NA
# 9 Phil. PA 2002 <NA> NA NA
# 10 Pitt PA 2003 <NA> NA NA
# 11 Phil. PA 2003 <NA> NA NA
# 12 Pitt PA 2004 <NA> NA NA
# 13 Phil. PA 2004 <NA> NA NA
# 14 Pitt PA 2005 <NA> NA NA
# 15 Phil. PA 2005 <NA> NA NA
现在 table 显示每个城邦年份并包含废除。这是我们可以总结的table
df = df %>%
group_by(Year, State) %>%
summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
#
# Year State annual_change
# <int> <chr> <dbl>
# 1 2001 PA 15
# 2 2002 PA 7
# 3 2003 PA 0
# 4 2004 PA -9
# 5 2005 PA 2
这让我们的政策在每个州年都没有变化。变化的累积总和使我们水平。
df = df %>%
ungroup() %>%
mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
# Year State annual_change PolicyNo
# <int> <chr> <dbl> <dbl>
# 1 2001 PA 15 15
# 2 2002 PA 7 22
# 3 2003 PA 0 22
# 4 2004 PA -9 13
# 5 2005 PA 2 15
使用 data.table
包,您可以按如下方式进行:
melt(setDT(dat),
measure.vars = c(3,5),
value.name = 'Year',
value.factor = FALSE)[!is.na(Year)
][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
][is.na(PolicyNo), PolicyNo := 0
][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
][, .(Year, State, PolicyNo = cumsum(PolicyNo))]
以上代码的结果:
Year State PolicyNo
1: 2001 PA 15
2: 2002 PA 22
3: 2003 PA 22
4: 2004 PA 13
5: 2005 PA 15
如您所见,需要几个步骤才能达到预期的最终结果:
- 首先转换为 data.table (
setDT(dat)
) 并将其重塑为长格式并删除没有 Year
的行
- 然后将具有
'Year_repealed'
的行的值设置为负值。
- 通过交叉连接 (
CJ
),您可以确保每个州的所有年份都存在,并将 PolicyNo
列中的 NA
值转换为零。
- 最后,你按年份汇总,然后对结果进行累加。
我在 R 中有一个数据集,我试图按列级别和年份进行聚合,如下所示:
City State Year Status Year_repealed PolicyNo
Pitt PA 2001 InForce 6
Phil. PA 2001 Repealed 2004 9
Pitt PA 2002 InForce 7
Pitt PA 2005 InForce 2
我想创建的是,对于每一年,我在考虑政策废除日期的情况下汇总各州的 PolicyNo。我得到的结果是:
Year State PolicyNo
2001 PA 15
2002 PA 22
2003 PA 22
2004 PA 12
2005 PA 14
我不确定如何以废除数据为条件拆分和聚合数据,并且想知道是否有一种方法可以轻松地实现这一点是 R。
这可能会帮助您将其分解为两个不同的问题。
- 得到一个 table 显示每个城邦年份中 PolicyNo 的变化。
- 总结 table 以显示每个州年的保单编号。
为了完成 (1),我们使用 NA
PolicyNo 添加缺失的年份,并将废除添加为负面 PolicyNo
观察。
library(dplyr)
df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
repeals = df %>%
filter(!is.na(Year_repealed)) %>%
mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
# City State Year Status Year_repealed PolicyNo
# 1 Phil. PA 2004 Repealed 2004 -9
all_years = expand.grid(City = unique(df$City), State = unique(df$State),
Year = 2001:2005)
df = bind_rows(df, repeals, all_years)
# City State Year Status Year_repealed PolicyNo
# 1 Pitt PA 2001 InForce NA 6
# 2 Phil. PA 2001 Repealed 2004 9
# 3 Pitt PA 2002 InForce NA 7
# 4 Pitt PA 2005 InForce NA 2
# 5 Phil. PA 2004 Repealed 2004 -9
# 6 Pitt PA 2001 <NA> NA NA
# 7 Phil. PA 2001 <NA> NA NA
# 8 Pitt PA 2002 <NA> NA NA
# 9 Phil. PA 2002 <NA> NA NA
# 10 Pitt PA 2003 <NA> NA NA
# 11 Phil. PA 2003 <NA> NA NA
# 12 Pitt PA 2004 <NA> NA NA
# 13 Phil. PA 2004 <NA> NA NA
# 14 Pitt PA 2005 <NA> NA NA
# 15 Phil. PA 2005 <NA> NA NA
现在 table 显示每个城邦年份并包含废除。这是我们可以总结的table
df = df %>%
group_by(Year, State) %>%
summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
#
# Year State annual_change
# <int> <chr> <dbl>
# 1 2001 PA 15
# 2 2002 PA 7
# 3 2003 PA 0
# 4 2004 PA -9
# 5 2005 PA 2
这让我们的政策在每个州年都没有变化。变化的累积总和使我们水平。
df = df %>%
ungroup() %>%
mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
# Year State annual_change PolicyNo
# <int> <chr> <dbl> <dbl>
# 1 2001 PA 15 15
# 2 2002 PA 7 22
# 3 2003 PA 0 22
# 4 2004 PA -9 13
# 5 2005 PA 2 15
使用 data.table
包,您可以按如下方式进行:
melt(setDT(dat),
measure.vars = c(3,5),
value.name = 'Year',
value.factor = FALSE)[!is.na(Year)
][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
][is.na(PolicyNo), PolicyNo := 0
][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
][, .(Year, State, PolicyNo = cumsum(PolicyNo))]
以上代码的结果:
Year State PolicyNo
1: 2001 PA 15
2: 2002 PA 22
3: 2003 PA 22
4: 2004 PA 13
5: 2005 PA 15
如您所见,需要几个步骤才能达到预期的最终结果:
- 首先转换为 data.table (
setDT(dat)
) 并将其重塑为长格式并删除没有Year
的行
- 然后将具有
'Year_repealed'
的行的值设置为负值。 - 通过交叉连接 (
CJ
),您可以确保每个州的所有年份都存在,并将PolicyNo
列中的NA
值转换为零。 - 最后,你按年份汇总,然后对结果进行累加。