在 group_by()/summarize() 循环中使用条件
Using conditions in group_by()/summarize() loop
我有一个看起来像这样的数据框(我有很多年和变量):
Name State2014 State2015 State2016 Tuition2014 Tuition2015 Tuition2016 StateGrants2014
Jared CA CA MA 22430 23060 40650 5000
Beth CA CA CA 36400 37050 37180 4200
Steven MA MA MA 18010 18250 18720 NA
Lary MA CA MA 24080 30800 24600 6600
Tom MA OR OR 40450 15800 16040 NA
Alfred OR OR OR 23570 23680 23750 3500
Cathy OR OR OR 32070 32070 33040 4700
我的objective(在本例中)是获取每个州的平均学费,以及每个州的州助学金总和。我的想法是按年份对数据进行子集化:
State2014 Tuition2014 StateGrants2014
CA 22430 5000
CA 36400 4200
MA 18010 NA
MA 24080 6600
MA 40450 NA
OR 23570 3500
OR 32070 4700
State2015 Tuition2015
CA 23060
CA 37050
MA 18250
CA 30800
OR 15800
OR 23680
OR 32070
State2016 Tuition2016
MA 40650
CA 37180
MA 18720
MA 24600
OR 16040
OR 23750
OR 33040
然后我会 group_by
state 和 summarize
(并将每个保存为单独的 df)以获得以下内容:
State2014 Tuition2014 StateGrants2014
CA 29415 9200
MA 27513 6600
OR 27820 6600
State2015 Tuition2015
CA 30303
MA 18250
OR 23850
State2016 Tuition2016
CA 37180
MA 27990
OR 24277
然后我将按州合并。这是我的代码:
years = c(2014,2015,2016)
for (i in seq_along(years){
#grab the variables from a certain year and save as a new df.
df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]
#Take off the year from each variable name (to make it easier to summarize)
names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
#this part of the code does not work. In this example, I only want to have this part if the year is 2016.
if (years[[i]]=='2016')
{Stategrant = mean(Stategrant, na.rm = TRUE)})
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
我有大约 50 年的数据和大量变量,所以我想使用循环。所以我的问题是,如何在 group_by()
/summarize()
函数中添加条件语句(总结以年份为条件的某些变量)?谢谢!
*编辑:我意识到我可以把 if{}
从函数中取出来,然后做类似的事情:
if (years[[i]]==2016){
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
Stategrant = mean(Stategrant, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
else{
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
{
}
但是变量的组合太多了,使用 for 循环的效率和用处都不是很大。
使用 tidy
数据要容易得多,所以让我向您展示如何整理数据。参见 http://r4ds.had.co.nz/tidy-data.html。
library(tidyr)
library(dplyr)
df <- gather(df, key, value, -Name) %>%
# separate years from the variables
separate(key, c("var", "year"), sep = -5) %>%
# the above line splits up e.g. State2014 into State and 2014.
# It does so by splitting at the fifth element from the end of the
# entry. Please check that this works for your other variables
# in case your naming conventions are inconsistent.
spread(var, value) %>%
# turn numbers back to numeric
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
gather(var, val, -Name, -year, -State) %>%
# group by the variables of interest. Note that `var` here
# refers to Tuition and StateGrants. If you have more variables,
# they will be included here as well. If you want to exclude more
# variables from being included here in `var`, add more "-colName"
# entries in the `gather` statement above
group_by(year, State, var) %>%
# summarize:
summarise(mean_values = mean(val))
这给你:
Source: local data frame [18 x 4]
Groups: year, State [?]
year State var mean_values
<chr> <chr> <chr> <dbl>
1 2014 CA StateGrants 4600.00
2 2014 CA Tuition 29415.00
3 2014 MA StateGrants NA
4 2014 MA Tuition 27513.33
5 2014 OR StateGrants 4100.00
6 2014 OR Tuition 27820.00
7 2015 CA StateGrants NA
8 2015 CA Tuition 30303.33
9 2015 MA StateGrants NA
10 2015 MA Tuition 18250.00
11 2015 OR StateGrants NA
12 2015 OR Tuition 23850.00
13 2016 CA StateGrants NA
14 2016 CA Tuition 37180.00
15 2016 MA StateGrants NA
16 2016 MA Tuition 27990.00
17 2016 OR StateGrants NA
18 2016 OR Tuition 24276.67
如果你不喜欢这个形状,你可以在 summarise
语句后面添加一个 %>% spread(var, mean_values)
以在不同的列中具有 Tuition 和 StateGrants 的方法。
如果您想计算学费和助学金的不同函数(例如,学费的平均值和助学金的总和,您可以执行以下操作:
df <- gather(df, key, value, -Name) %>%
separate(key, c("var", "year"), sep = -5) %>%
spread(var, value) %>%
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
group_by(year, State) %>%
summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )
这给你:
Source: local data frame [9 x 4]
Groups: year [?]
year State Grant_Sum Tuition_Mean
<chr> <chr> <dbl> <dbl>
1 2014 CA 9200 29415.00
2 2014 MA 6600 27513.33
3 2014 OR 8200 27820.00
4 2015 CA 0 30303.33
5 2015 MA 0 18250.00
6 2015 OR 0 23850.00
7 2016 CA 0 37180.00
8 2016 MA 0 27990.00
9 2016 OR 0 24276.67
请注意,我在这里使用了 sum
和 na.rm = T
,如果所有元素都是 NA
,则 returns 为 0。确保这在您的用例中有意义。
此外,顺便提一下,要获得您要求的个人 data.frames
,您可以使用 filter(year == 2014)
等,如 df_2014 <- filter(df, year == 2014)
.
我有一个看起来像这样的数据框(我有很多年和变量):
Name State2014 State2015 State2016 Tuition2014 Tuition2015 Tuition2016 StateGrants2014
Jared CA CA MA 22430 23060 40650 5000
Beth CA CA CA 36400 37050 37180 4200
Steven MA MA MA 18010 18250 18720 NA
Lary MA CA MA 24080 30800 24600 6600
Tom MA OR OR 40450 15800 16040 NA
Alfred OR OR OR 23570 23680 23750 3500
Cathy OR OR OR 32070 32070 33040 4700
我的objective(在本例中)是获取每个州的平均学费,以及每个州的州助学金总和。我的想法是按年份对数据进行子集化:
State2014 Tuition2014 StateGrants2014
CA 22430 5000
CA 36400 4200
MA 18010 NA
MA 24080 6600
MA 40450 NA
OR 23570 3500
OR 32070 4700
State2015 Tuition2015
CA 23060
CA 37050
MA 18250
CA 30800
OR 15800
OR 23680
OR 32070
State2016 Tuition2016
MA 40650
CA 37180
MA 18720
MA 24600
OR 16040
OR 23750
OR 33040
然后我会 group_by
state 和 summarize
(并将每个保存为单独的 df)以获得以下内容:
State2014 Tuition2014 StateGrants2014
CA 29415 9200
MA 27513 6600
OR 27820 6600
State2015 Tuition2015
CA 30303
MA 18250
OR 23850
State2016 Tuition2016
CA 37180
MA 27990
OR 24277
然后我将按州合并。这是我的代码:
years = c(2014,2015,2016)
for (i in seq_along(years){
#grab the variables from a certain year and save as a new df.
df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]
#Take off the year from each variable name (to make it easier to summarize)
names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
#this part of the code does not work. In this example, I only want to have this part if the year is 2016.
if (years[[i]]=='2016')
{Stategrant = mean(Stategrant, na.rm = TRUE)})
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
我有大约 50 年的数据和大量变量,所以我想使用循环。所以我的问题是,如何在 group_by()
/summarize()
函数中添加条件语句(总结以年份为条件的某些变量)?谢谢!
*编辑:我意识到我可以把 if{}
从函数中取出来,然后做类似的事情:
if (years[[i]]==2016){
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
Stategrant = mean(Stategrant, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
else{
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
{
}
但是变量的组合太多了,使用 for 循环的效率和用处都不是很大。
使用 tidy
数据要容易得多,所以让我向您展示如何整理数据。参见 http://r4ds.had.co.nz/tidy-data.html。
library(tidyr)
library(dplyr)
df <- gather(df, key, value, -Name) %>%
# separate years from the variables
separate(key, c("var", "year"), sep = -5) %>%
# the above line splits up e.g. State2014 into State and 2014.
# It does so by splitting at the fifth element from the end of the
# entry. Please check that this works for your other variables
# in case your naming conventions are inconsistent.
spread(var, value) %>%
# turn numbers back to numeric
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
gather(var, val, -Name, -year, -State) %>%
# group by the variables of interest. Note that `var` here
# refers to Tuition and StateGrants. If you have more variables,
# they will be included here as well. If you want to exclude more
# variables from being included here in `var`, add more "-colName"
# entries in the `gather` statement above
group_by(year, State, var) %>%
# summarize:
summarise(mean_values = mean(val))
这给你:
Source: local data frame [18 x 4]
Groups: year, State [?]
year State var mean_values
<chr> <chr> <chr> <dbl>
1 2014 CA StateGrants 4600.00
2 2014 CA Tuition 29415.00
3 2014 MA StateGrants NA
4 2014 MA Tuition 27513.33
5 2014 OR StateGrants 4100.00
6 2014 OR Tuition 27820.00
7 2015 CA StateGrants NA
8 2015 CA Tuition 30303.33
9 2015 MA StateGrants NA
10 2015 MA Tuition 18250.00
11 2015 OR StateGrants NA
12 2015 OR Tuition 23850.00
13 2016 CA StateGrants NA
14 2016 CA Tuition 37180.00
15 2016 MA StateGrants NA
16 2016 MA Tuition 27990.00
17 2016 OR StateGrants NA
18 2016 OR Tuition 24276.67
如果你不喜欢这个形状,你可以在 summarise
语句后面添加一个 %>% spread(var, mean_values)
以在不同的列中具有 Tuition 和 StateGrants 的方法。
如果您想计算学费和助学金的不同函数(例如,学费的平均值和助学金的总和,您可以执行以下操作:
df <- gather(df, key, value, -Name) %>%
separate(key, c("var", "year"), sep = -5) %>%
spread(var, value) %>%
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
group_by(year, State) %>%
summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )
这给你:
Source: local data frame [9 x 4]
Groups: year [?]
year State Grant_Sum Tuition_Mean
<chr> <chr> <dbl> <dbl>
1 2014 CA 9200 29415.00
2 2014 MA 6600 27513.33
3 2014 OR 8200 27820.00
4 2015 CA 0 30303.33
5 2015 MA 0 18250.00
6 2015 OR 0 23850.00
7 2016 CA 0 37180.00
8 2016 MA 0 27990.00
9 2016 OR 0 24276.67
请注意,我在这里使用了 sum
和 na.rm = T
,如果所有元素都是 NA
,则 returns 为 0。确保这在您的用例中有意义。
此外,顺便提一下,要获得您要求的个人 data.frames
,您可以使用 filter(year == 2014)
等,如 df_2014 <- filter(df, year == 2014)
.