R条件查找和求和

R conditional lookup and sum

我有关于大学课程完成情况的数据,估计每个队列的学生人数在 1、2、3、... 7 年后完成。我想用这些估计来计算任何一年从每个学院和课程输出的学生总数。

给定年份学生的输出将是前 7 个队列在 1、2、3、... 7 年后输出的总和。

例如COLLEGE 1, COURSE A 2014年输出的学生人数等于:

Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +

所以有两个数据框:一个包含所有输出估计值的查找 table,以及我正在尝试修改的较小摘要 table。我想更新 dummy.summary$output,对于每一行,基于上述计算的总输出。

以下代码可以很好地复制我的数据

# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
           college = rep(rep(paste("College", 1:6), each = 35), 17),
           course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
           intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
           output.year = rep(1:7, 510),
           output = sample(x = 10:20, size = 3570, replace=TRUE))


# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0

以下代码不起作用,但展示了我一直在尝试的方法。

dummy.summary$output <- sapply(dummy.summary$output, function(x){

    # empty vector to fill with output values
    vec <- c()

    # Find relevant output for college + course, from each cohort and exit year
    for(j in 1:7){

      append(x = vec,
             values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
                                     dummy.lookup$course==dummy.summary[x, "course"] &
                                     dummy.lookup$cohort==dummy.summary[x, "year"]-j &
                                     dummy.lookup$output.year==j, "output"])

    }

    # Sum and return total output
    sum_vec <- sum(vec)

    return(sum_vec)

  }
    )

我猜它不起作用,因为我希望在匿名函数中使用 'x' 来索引 dummy.summary 数据帧的特定值。但这显然没有发生,并且每行只返回零,大概是因为 'x' 的起始值每次都是零。我不知道是否可以访问 sapply 循环的每个值的 索引位置 ,并使用它来索引我的摘要数据框。

这种方法是否可以修复,还是我需要一种完全不同的方法?

即使它是可以修复的,是否有更多 elegant/faster 方法来实现我想要做的事情?

感谢期待。

我刚刚将您的 output.year 更新为 output.year2,它不是从 1 到 7 的值,而是根据您拥有的 cohort 获得年份值。

我发现你要的output信息对应的是output.year,但是你要的intake信息对应的是cohort。所以,我分别计算它们,然后加入tables/information。这会自动创建空的(我转换为 0 的 NA)output 1998 年的信息。

# fix your random sampling
set.seed(24)  

# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
                           college = rep(rep(paste("College", 1:6), each = 35), 17),
                           course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
                           intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
                           output.year = rep(1:7, 510),
                           output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0


library(dplyr)


# create result table for output info
dt_output = 
  dummy.lookup %>%
  mutate(output.year2 = output.year+cohort) %>%     # update output.year to get a year value
  group_by(output.year2, college, course) %>%       # for each output year, college, course
  summarise(SumOutput = sum(output)) %>%            # calculate sum of intake
  ungroup() %>%
  arrange(college,course,output.year2) %>%          # for visualisation purposes
  rename(cohort = output.year2)                     # rename column


# create result for intake info
dt_intake =
  dummy.lookup %>%
  select(cohort, college, course, intake) %>%     # select useful columns
  distinct()                                      # keep distinct rows/values


# join info
dt_intake %>% 
  full_join(dt_output, by=c("cohort","college","course")) %>%
  mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
  arrange(college,course,cohort) %>%     # for visualisation purposes
  tbl_df()       # for printing purposes


# Source: local data frame [720 x 5]
# 
# cohort   college   course intake SumOutput
# (int)    (fctr)   (fctr)  (int)     (dbl)
# 1    1998 College 1 Course A    194         0
# 2    1999 College 1 Course A    198        11
# 3    2000 College 1 Course A    223        29
# 4    2001 College 1 Course A    198        45
# 5    2002 College 1 Course A    289        62
# 6    2003 College 1 Course A    163        78
# 7    2004 College 1 Course A    211        74
# 8    2005 College 1 Course A    181       108
# 9    2006 College 1 Course A    277       101
# 10   2007 College 1 Course A    157       109
# ..    ...       ...      ...    ...       ...