在给定条件下创建总结观察的新变量

Create new variable that summarizes observation given a certain condition

你好,我是 R 的新手,我不明白为什么我的以下方法不起作用。我有这个 df1,看起来像这样:

  view  duration_hours  date 
1  a        5          2021-03-29            
2  a        7          2021-03-29           
3  a        3          2021-03-30            
4  b        2          2021-03-30
5  b        5          2021-03-30
6  c        9          2021-03-30      
7  c        2          2021-03-31            
8  c        3          2021-04-01

我想要一个新的数据框 (df2),它可以对所有视图的持续时间求和并拆分为特定日期的单个视图

  date duration  duration_sum    a    b     c 
1  2021-03-29       12           12   0     0
2  2021-03-30       19           3    7     9           
3  2021-03-31       2            0    0     2 
4  2021-04-01       3            0    0     3

首先,我仅针对“整体”持续时间尝试了以下方法,按照预期创建了“duration_sum”变量,其中包含每个日期的总持续时间

df2 <- df1 %>%
  group_by(date) %>%
  summarise(duration_sum = sum(duration_hours, na.rm = TRUE)

然后我尝试通过以下方式扩充代码来添加其他变量

df2<- df1 %>%
  group_by(date) %>%
  summarise(duration_sum = sum(duration_hours, na.rm = TRUE),
            a =sum(duration_hours[view=="a"], na.r = TRUE),
            b =sum(duration_hours[view=="b"], na.r = TRUE),
            c =sum(duration_hours[view=="c"], na.r = TRUE))

但这并没有使账户产生正确的金额。我做错了什么?

参数是 na.rm 而不是 na.r。当我们有一个不匹配的参数时,TRUE 被强制为 1(FALSE 为 0 - 因此总数加 1)

例如

sum(c(1, 2), na.r = TRUE)
#[1] 4
sum(c(1, 2), na.rm = TRUE)
#[1] 3

OP 的更正代码为

library(dplyr)
df1 %>%
  group_by(date) %>%
   summarise(duration_sum = sum(duration_hours, na.rm = TRUE),
        a =sum(duration_hours[view=="a"], na.rm = TRUE),
        b =sum(duration_hours[view=="b"], na.rm = TRUE),
        c =sum(duration_hours[view=="c"], na.rm = TRUE))
# A tibble: 4 x 5
#  date       duration_sum     a     b     c
#* <chr>             <int> <int> <int> <int>
#1 2021-03-29           12    12     0     0
#2 2021-03-30           19     3     7     9
#3 2021-03-31            2     0     0     2
#4 2021-04-01            3     0     0     3

或者另一种选择是pivot_wider

library(tidyr)
pivot_wider(df1, names_from = view, values_from = duration_hours,    
         values_fn = sum, values_fill = 0)

数据

df1 <- structure(list(view = c("a", "a", "a", "b", "b", "c", "c", "c"
), duration_hours = c(5L, 7L, 3L, 2L, 5L, 9L, 2L, 3L), date = c("2021-03-29", 
"2021-03-29", "2021-03-30", "2021-03-30", "2021-03-30", "2021-03-30", 
"2021-03-31", "2021-04-01")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))