在 group_by()/summarize() 循环中使用条件

Question

我有一个看起来像这样的数据框（我有很多年和变量）：

Name    State2014     State2015  State2016  Tuition2014   Tuition2015  Tuition2016  StateGrants2014
Jared   CA            CA         MA         22430         23060        40650        5000
Beth    CA            CA         CA         36400         37050        37180        4200
Steven  MA            MA         MA         18010         18250        18720        NA
Lary    MA            CA         MA         24080         30800        24600        6600
Tom     MA            OR         OR         40450         15800        16040        NA
Alfred  OR            OR         OR         23570         23680        23750        3500
Cathy   OR            OR         OR         32070         32070        33040        4700

我的objective（在本例中）是获取每个州的平均学费，以及每个州的州助学金总和。我的想法是按年份对数据进行子集化：

State2014     Tuition2014   StateGrants2014
CA            22430         5000
CA            36400         4200
MA            18010         NA
MA            24080         6600
MA            40450         NA
OR            23570         3500
OR            32070         4700

State2015  Tuition2015  
CA         23060        
CA         37050        
MA         18250        
CA         30800        
OR         15800        
OR         23680        
OR         32070       

State2016  Tuition2016  
MA         40650        
CA         37180        
MA         18720        
MA         24600        
OR         16040        
OR         23750        
OR         33040

然后我会 group_by state 和 summarize（并将每个保存为单独的 df）以获得以下内容：

State2014     Tuition2014   StateGrants2014
CA            29415         9200
MA            27513         6600
OR            27820         6600

State2015  Tuition2015  
CA         30303        
MA         18250        
OR         23850    

State2016  Tuition2016  
CA         37180        
MA         27990        
OR         24277

然后我将按州合并。这是我的代码：

years = c(2014,2015,2016)
for (i in seq_along(years){
  #grab the variables from a certain year and save as a new df.
  df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]

  #Take off the year from each variable name (to make it easier to summarize)
  names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)

  df_year <- df_year %>%
    group_by(state) %>%
    summarize(Tuition = mean(Tuition, na.rm = TRUE),
            #this part of the code does not work. In this example, I only want to have this part if the year is 2016.
              if (years[[i]]=='2016')
                {Stategrant = mean(Stategrant, na.rm = TRUE)})

  #rename df_year to df####
  assign(paste("df",years[[i]],sep=''),df_year)
}

我有大约 50 年的数据和大量变量，所以我想使用循环。所以我的问题是，如何在 group_by()/summarize() 函数中添加条件语句（总结以年份为条件的某些变量）？谢谢！

*编辑：我意识到我可以把 if{} 从函数中取出来，然后做类似的事情：

  if (years[[i]]==2016){
      df_year <- df_year %>%
        group_by(state) %>%
        summarize(Tuition = mean(Tuition, na.rm = TRUE),
            Stategrant = mean(Stategrant, na.rm = TRUE))

      #rename df_year to df####
      assign(paste("df",years[[i]],sep=''),df_year)
  }

  else{
        df_year <- df_year %>%
            group_by(state) %>%
            summarize(Tuition = mean(Tuition, na.rm = TRUE))

          #rename df_year to df####
          assign(paste("df",years[[i]],sep=''),df_year)
  {
}

但是变量的组合太多了，使用 for 循环的效率和用处都不是很大。

Answer 1

使用 tidy 数据要容易得多，所以让我向您展示如何整理数据。参见 http://r4ds.had.co.nz/tidy-data.html。

library(tidyr)
library(dplyr)

df <- gather(df, key, value, -Name) %>% 
  # separate years from the variables
  separate(key, c("var", "year"), sep = -5) %>% 
  # the above line splits up e.g. State2014 into State and 2014.
  # It does so by splitting at the fifth element from the end of the
  # entry. Please check that this works for your other variables
  # in case your naming conventions are inconsistent.
  spread(var, value) %>% 
  # turn numbers back to numeric
  mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
  gather(var, val, -Name, -year, -State) %>% 
  # group by the variables of interest. Note that `var` here 
  # refers to Tuition and StateGrants. If you have more variables,
  # they will be included here as well. If you want to exclude more
  # variables from being included here in `var`, add more "-colName" 
  # entries in the `gather` statement above
  group_by(year, State, var) %>% 
  # summarize:
  summarise(mean_values = mean(val))

这给你：

Source: local data frame [18 x 4]
Groups: year, State [?]
    year State         var mean_values
   <chr> <chr>       <chr>       <dbl>
1   2014    CA StateGrants     4600.00
2   2014    CA     Tuition    29415.00
3   2014    MA StateGrants          NA
4   2014    MA     Tuition    27513.33
5   2014    OR StateGrants     4100.00
6   2014    OR     Tuition    27820.00
7   2015    CA StateGrants          NA
8   2015    CA     Tuition    30303.33
9   2015    MA StateGrants          NA
10  2015    MA     Tuition    18250.00
11  2015    OR StateGrants          NA
12  2015    OR     Tuition    23850.00
13  2016    CA StateGrants          NA
14  2016    CA     Tuition    37180.00
15  2016    MA StateGrants          NA
16  2016    MA     Tuition    27990.00
17  2016    OR StateGrants          NA
18  2016    OR     Tuition    24276.67

如果你不喜欢这个形状，你可以在 summarise 语句后面添加一个 %>% spread(var, mean_values) 以在不同的列中具有 Tuition 和 StateGrants 的方法。

如果您想计算学费和助学金的不同函数（例如，学费的平均值和助学金的总和，您可以执行以下操作：

df <- gather(df, key, value, -Name) %>% 
   separate(key, c("var", "year"), sep = -5) %>% 
   spread(var, value) %>% 
   mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
   group_by(year, State) %>% 
   summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )

这给你：

Source: local data frame [9 x 4]
Groups: year [?]

   year State Grant_Sum Tuition_Mean
  <chr> <chr>     <dbl>        <dbl>
1  2014    CA      9200     29415.00
2  2014    MA      6600     27513.33
3  2014    OR      8200     27820.00
4  2015    CA         0     30303.33
5  2015    MA         0     18250.00
6  2015    OR         0     23850.00
7  2016    CA         0     37180.00
8  2016    MA         0     27990.00
9  2016    OR         0     24276.67

请注意，我在这里使用了 sum 和 na.rm = T，如果所有元素都是 NA，则 returns 为 0。确保这在您的用例中有意义。

此外，顺便提一下，要获得您要求的个人 data.frames，您可以使用 filter(year == 2014) 等，如 df_2014 <- filter(df, year == 2014).

在 group_by()/summarize() 循环中使用条件

Using conditions in group_by()/summarize() loop

r

summarization

dplyr