R：dplyr 总结，仅求和唯一值

Question

我在使用一个讨厌的命令时遇到了问题，我想使用 dplyr 包来分析摘要。用一些示例数据最容易解释：

structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), 
    Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George", 
    "Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L, 
    1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L), 
    Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L, 
    100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L, 
    20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year", 
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA, 
-9L))

两个简单的总结是我的目标：首先，我想用Date做一个总结，代码如下。错误的部分是 total_balance_sum 计算，其中我想对每个人的余额求和，但每个人只计算一次。因此，例如，我对 Date=1 命令的结果是 total_balance_sum=100，但它应该是 150（将 Jack 的 100 的 total_balance 添加到 Mary 的 total_balance 50一次）。这个错误的计算显然打乱了最终的 pct 计算。

example_data %>% 
  group_by(Date) %>% 
  summarise(
    total_people=n_distinct(Name),
    total_loan_exposures=n(),

    special_sum=sum(Special_Balance,na.rm=TRUE),
    total_balance_sum=sum(Total_Balance[n_distinct(Name)]), 
    total_pct=special_sum/total_balance_sum

  ) -> example_summary

在第二个摘要（下方）中，我同时按日期和出生年份分组，但再次计算 total_balance_sum 不正确。

example_data %>% 
  group_by(Date,Birth.Year) %>% 
  summarise(
    total_people=n_distinct(Name),
    total_loan_exposures=n(),

    special_sum=sum(Special_Balance,na.rm=TRUE),
    total_balance_sum=sum(Total_Balance[n_distinct(Name)]), 
    total_pct=special_sum/total_balance_sum

  ) -> example_summary_birthyear

实现我的目标的正确方法是什么？很明显，我正在使用的 n_distinct 只是采用其中一个值，而不是在名称之间正确地求和。

感谢您的帮助。

Answer 1

我不太清楚您的要求，但这是否符合您的要求？：（仅针对第一个示例）

example_data %>% 
  group_by(Date, Name) %>% 
    summarise(
      total_loan_exposures=n(),
      total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
      total_balance_sumPerson=Total_Balance[1])%>% 
  ungroup() %>% 
  group_by(Date) %>% 
  summarise(
    total_people=n(),
    total_loan_exposures=sum(total_loan_exposures),
    special_sum=sum(total_SpecialPerson,na.rm=TRUE),
    total_balance_sum=sum(total_balance_sumPerson)) %>% 
  mutate(total_pct=(special_sum/total_balance_sum))-> example_summary

> example_summary
Source: local data frame [3 x 6]

    Date total_people total_loan_exposures special_sum total_balance_sum  total_pct
    1    1            2                    3          80               150 0.53333333
    2    2            2                    4          32               220 0.14545455
    3    3            2                    2         101              1700 0.05941176

Answer 2

对于第二个示例（对于第一个，只需删除 Birth.Year）：

library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
                 mutate(special_sum = sum(Special_Balance),
                        total_loan_exposure = n( )) %>%
                 distinct(Name, Total_Balance) %>%
                 summarise(Total_balance_sum = sum(Total_Balance),
                           special_sum = special_sum[1],
                           total_people = n(),
                           total_loan_exposure = total_loan_exposure[1],
                           special_sum/Total_balance_sum)

R：dplyr 总结，仅求和唯一值

R: dplyr summarize, sum only values of uniques

r

summary

unique

dplyr