R中同时使用两列的聚合函数
Aggregate function in R using two columns simultaneously
数据:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
代码:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
输出:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
我想 aggregate/summarize 上面的数据框使用 Year 和 Balance 两列。我使用基本函数 aggregate 来执行此操作。我需要最近一年/最近一年的最大余额。输出的第一行,John 有最新的年份 (2016),但余额为 (2015),这不是我需要的,它应该输出 100 而不是 150。我哪里出错了?
有点讽刺的是,aggregate
是一个糟糕的聚合工具。你可以让它工作,但我会这样做:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
我会建议使用库 dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
这是另一个没有 data.table 包的解决方案。
首先对数据框进行排序,
df <- df[order(-df$Year, -df$Balance),]
然后select每组同名第一个
df[!duplicated[df$Name],]
数据:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
代码:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
输出:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
我想 aggregate/summarize 上面的数据框使用 Year 和 Balance 两列。我使用基本函数 aggregate 来执行此操作。我需要最近一年/最近一年的最大余额。输出的第一行,John 有最新的年份 (2016),但余额为 (2015),这不是我需要的,它应该输出 100 而不是 150。我哪里出错了?
有点讽刺的是,aggregate
是一个糟糕的聚合工具。你可以让它工作,但我会这样做:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
我会建议使用库 dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
这是另一个没有 data.table 包的解决方案。
首先对数据框进行排序,
df <- df[order(-df$Year, -df$Balance),]
然后select每组同名第一个
df[!duplicated[df$Name],]