R条件查找和求和
R conditional lookup and sum
我有关于大学课程完成情况的数据,估计每个队列的学生人数在 1、2、3、... 7 年后完成。我想用这些估计来计算任何一年从每个学院和课程输出的学生总数。
给定年份学生的输出将是前 7 个队列在 1、2、3、... 7 年后输出的总和。
例如COLLEGE 1, COURSE A 2014年输出的学生人数等于:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
所以有两个数据框:一个包含所有输出估计值的查找 table,以及我正在尝试修改的较小摘要 table。我想更新 dummy.summary$output,对于每一行,基于上述计算的总输出。
以下代码可以很好地复制我的数据
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
以下代码不起作用,但展示了我一直在尝试的方法。
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
我猜它不起作用,因为我希望在匿名函数中使用 'x' 来索引 dummy.summary 数据帧的特定值。但这显然没有发生,并且每行只返回零,大概是因为 'x' 的起始值每次都是零。我不知道是否可以访问 sapply 循环的每个值的 索引位置 ,并使用它来索引我的摘要数据框。
这种方法是否可以修复,还是我需要一种完全不同的方法?
即使它是可以修复的,是否有更多 elegant/faster 方法来实现我想要做的事情?
感谢期待。
我刚刚将您的 output.year
更新为 output.year2
,它不是从 1 到 7 的值,而是根据您拥有的 cohort
获得年份值。
我发现你要的output
信息对应的是output.year
,但是你要的intake
信息对应的是cohort
。所以,我分别计算它们,然后加入tables/information。这会自动创建空的(我转换为 0 的 NA)output
1998 年的信息。
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...
我有关于大学课程完成情况的数据,估计每个队列的学生人数在 1、2、3、... 7 年后完成。我想用这些估计来计算任何一年从每个学院和课程输出的学生总数。
给定年份学生的输出将是前 7 个队列在 1、2、3、... 7 年后输出的总和。
例如COLLEGE 1, COURSE A 2014年输出的学生人数等于:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
所以有两个数据框:一个包含所有输出估计值的查找 table,以及我正在尝试修改的较小摘要 table。我想更新 dummy.summary$output,对于每一行,基于上述计算的总输出。
以下代码可以很好地复制我的数据
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
以下代码不起作用,但展示了我一直在尝试的方法。
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
我猜它不起作用,因为我希望在匿名函数中使用 'x' 来索引 dummy.summary 数据帧的特定值。但这显然没有发生,并且每行只返回零,大概是因为 'x' 的起始值每次都是零。我不知道是否可以访问 sapply 循环的每个值的 索引位置 ,并使用它来索引我的摘要数据框。
这种方法是否可以修复,还是我需要一种完全不同的方法?
即使它是可以修复的,是否有更多 elegant/faster 方法来实现我想要做的事情?
感谢期待。
我刚刚将您的 output.year
更新为 output.year2
,它不是从 1 到 7 的值,而是根据您拥有的 cohort
获得年份值。
我发现你要的output
信息对应的是output.year
,但是你要的intake
信息对应的是cohort
。所以,我分别计算它们,然后加入tables/information。这会自动创建空的(我转换为 0 的 NA)output
1998 年的信息。
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...