R:dplyr 总结,仅求和唯一值
R: dplyr summarize, sum only values of uniques
我在使用一个讨厌的命令时遇到了问题,我想使用 dplyr
包来分析摘要。用一些示例数据最容易解释:
structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George",
"Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L,
1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L),
Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L,
100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L,
20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year",
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA,
-9L))
两个简单的总结是我的目标:首先,我想用Date
做一个总结,代码如下。错误的部分是 total_balance_sum
计算,其中我想对每个人的余额求和,但每个人只计算一次。因此,例如,我对 Date=1
命令的结果是 total_balance_sum=100
,但它应该是 150(将 Jack 的 100 的 total_balance
添加到 Mary 的 total_balance
50一次)。这个错误的计算显然打乱了最终的 pct
计算。
example_data %>%
group_by(Date) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary
在第二个摘要(下方)中,我同时按日期和出生年份分组,但再次计算 total_balance_sum
不正确。
example_data %>%
group_by(Date,Birth.Year) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary_birthyear
实现我的目标的正确方法是什么?很明显,我正在使用的 n_distinct
只是采用其中一个值,而不是在名称之间正确地求和。
感谢您的帮助。
我不太清楚您的要求,但这是否符合您的要求?:(仅针对第一个示例)
example_data %>%
group_by(Date, Name) %>%
summarise(
total_loan_exposures=n(),
total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
total_balance_sumPerson=Total_Balance[1])%>%
ungroup() %>%
group_by(Date) %>%
summarise(
total_people=n(),
total_loan_exposures=sum(total_loan_exposures),
special_sum=sum(total_SpecialPerson,na.rm=TRUE),
total_balance_sum=sum(total_balance_sumPerson)) %>%
mutate(total_pct=(special_sum/total_balance_sum))-> example_summary
> example_summary
Source: local data frame [3 x 6]
Date total_people total_loan_exposures special_sum total_balance_sum total_pct
1 1 2 3 80 150 0.53333333
2 2 2 4 32 220 0.14545455
3 3 2 2 101 1700 0.05941176
对于第二个示例(对于第一个,只需删除 Birth.Year):
library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
mutate(special_sum = sum(Special_Balance),
total_loan_exposure = n( )) %>%
distinct(Name, Total_Balance) %>%
summarise(Total_balance_sum = sum(Total_Balance),
special_sum = special_sum[1],
total_people = n(),
total_loan_exposure = total_loan_exposure[1],
special_sum/Total_balance_sum)
我在使用一个讨厌的命令时遇到了问题,我想使用 dplyr
包来分析摘要。用一些示例数据最容易解释:
structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George",
"Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L,
1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L),
Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L,
100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L,
20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year",
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA,
-9L))
两个简单的总结是我的目标:首先,我想用Date
做一个总结,代码如下。错误的部分是 total_balance_sum
计算,其中我想对每个人的余额求和,但每个人只计算一次。因此,例如,我对 Date=1
命令的结果是 total_balance_sum=100
,但它应该是 150(将 Jack 的 100 的 total_balance
添加到 Mary 的 total_balance
50一次)。这个错误的计算显然打乱了最终的 pct
计算。
example_data %>%
group_by(Date) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary
在第二个摘要(下方)中,我同时按日期和出生年份分组,但再次计算 total_balance_sum
不正确。
example_data %>%
group_by(Date,Birth.Year) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary_birthyear
实现我的目标的正确方法是什么?很明显,我正在使用的 n_distinct
只是采用其中一个值,而不是在名称之间正确地求和。
感谢您的帮助。
我不太清楚您的要求,但这是否符合您的要求?:(仅针对第一个示例)
example_data %>%
group_by(Date, Name) %>%
summarise(
total_loan_exposures=n(),
total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
total_balance_sumPerson=Total_Balance[1])%>%
ungroup() %>%
group_by(Date) %>%
summarise(
total_people=n(),
total_loan_exposures=sum(total_loan_exposures),
special_sum=sum(total_SpecialPerson,na.rm=TRUE),
total_balance_sum=sum(total_balance_sumPerson)) %>%
mutate(total_pct=(special_sum/total_balance_sum))-> example_summary
> example_summary
Source: local data frame [3 x 6]
Date total_people total_loan_exposures special_sum total_balance_sum total_pct
1 1 2 3 80 150 0.53333333
2 2 2 4 32 220 0.14545455
3 3 2 2 101 1700 0.05941176
对于第二个示例(对于第一个,只需删除 Birth.Year):
library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
mutate(special_sum = sum(Special_Balance),
total_loan_exposure = n( )) %>%
distinct(Name, Total_Balance) %>%
summarise(Total_balance_sum = sum(Total_Balance),
special_sum = special_sum[1],
total_people = n(),
total_loan_exposure = total_loan_exposure[1],
special_sum/Total_balance_sum)