R 中具有多个条件的列均值、方差和布尔值变异

Column mean, variance and boolean mutation with multiple conditions in R

我有一个长格式的 title-day 面板数据集。在这个复制品中,有三个人给出了分数(1、2 和 3)。对于每个人的分数本身,是否为该标题和日期(布尔值)给出分数以及对分数进行编码的日期。最后一个保留为该人的头衔中的唯一变量常量。当此人未给出任何分数时,将用 NA 表示。在 df1 中查看此处:

title <- c("x","x","x","x","y","y","y","y","z","z","z","z")
day <- c(0,1,2,3,0,1,2,3,0,1,2,3)
avg_score <- c(0,0,0,0,0,0,0,0,0,0,0,0)
variance <- c(0,0,0,0,0,0,0,0,0,0,0,0)
score_or_not <- c(0,0,0,0,0,0,0,0,0,0,0,0)
score_1 <- c(0,0,0,30,NA,NA,NA,NA,0,0,0,50)
score_or_not1 <- c(0,0,0,1,NA,NA,NA,NA,0,0,0,1)
score_day1 <- c(3,3,3,3,NA,NA,NA,NA,3,3,3,3)
score_2 <- c(NA,NA,NA,NA,0,80,80,80,0,0,80,80)
score_or_not2 <- c(NA,NA,NA,NA,0,1,1,1,0,0,1,1)
score_day2 <- c(NA,NA,NA,NA,1,1,1,1,2,2,2,2)
score_3 <- c(0,0,0,0,NA,NA,NA,NA,90,90,90,90)
score_or_not3 <- c(0,0,0,0,NA,NA,NA,NA,1,1,1,1)
score_day3 <- c(-2,-2,-2,-2,NA,NA,NA,NA,0,0,0,0)

df1 <- data.frame(title,day,avg_score,variance,score_or_not,score_1,score_or_not1,score_day1,score_2,score_or_not2,score_day2,score_3,score_or_not3,score_day3)

我遇到了以下问题。我需要三个基于这些给定分数的新列(avg_score、方差和 score_or_not)。但是,有一些条件,即当 score_day 为 负数或零时 分数应 而不是 被考虑用于新的列,并且应该像 NA 列一样被忽略。重要的是 NA 值保持 NA 并且负值或 0 值也保持不变。

这里是对三个新变量的描述: 1. avg_score 应该成为所有给出的分数的平均分数,只有当它们满足条件时。当只有一个分数时,该分数应该是avg_score的值。 2. 当没有或只有一个分数可用时,方差应为 0。当有 2 个或更多时,应在此列中计算方差。 3. Score_or_not 应该是一个布尔值,我们可以在其中查看当天是否有分数。当然也考虑到了条件。

结果应如下所示:

title <- c("x","x","x","x","y","y","y","y","z","z","z","z")
day <- c(0,1,2,3,0,1,2,3,0,1,2,3)
avg_score <- c(0,0,0,30,0,80,80,80,0,0,80,65)
variance <- c(0,0,0,0,0,0,0,0,0,0,0,450)
score_or_not <- c(0,0,0,1,0,1,1,1,0,0,1,1)
score_1 <- c(0,0,0,30,NA,NA,NA,NA,0,0,0,50)
score_or_not1 <- c(0,0,0,1,NA,NA,NA,NA,0,0,0,1)
score_day1 <- c(3,3,3,3,NA,NA,NA,NA,3,3,3,3)
score_2 <- c(NA,NA,NA,NA,0,80,80,80,0,0,80,80)
score_or_not2 <- c(NA,NA,NA,NA,0,1,1,1,0,0,1,1)
score_day2 <- c(NA,NA,NA,NA,1,1,1,1,2,2,2,2)
score_3 <- c(0,0,0,0,NA,NA,NA,NA,90,90,90,90)
score_or_not3 <- c(0,0,0,0,NA,NA,NA,NA,1,1,1,1)
score_day3 <- c(-2,-2,-2,-2,NA,NA,NA,NA,0,0,0,0)

Output <- data.frame(title,day,avg_score,variance,score_or_not,score_1,score_or_not1,score_day1,score_2,score_or_not2,score_day2,score_3,score_or_not3,score_day3)

希望你们能解决这个具体问题..

可能最容易重塑,然后根据您的条件对所有 3 个人进行计算,然后加入原始数据框。

library(dplyr)
library(tidyr)

left_join(df1,
          pivot_longer(df1, cols=-c(title, day),
                       names_to=c(".value","person"),
                       names_pattern="(.*)(\d)") %>%
            filter(score_day>0 & score_or_not==1) %>%
            group_by(title, day) %>%
            summarise(avg_score=mean(score_, na.rm=TRUE),
                      variance=var(score_, na.rm=TRUE),
                      score_or_not=+(avg_score>0)),
          by=c('title','day')) %>%
  mutate(avg_score=replace_na(avg_score,0), 
         variance=replace_na(variance, 0), 
         score_or_not=replace_na(score_or_not, 0))

结果:

...

   avg_score variance score_or_not
1          0        0            0
2          0        0            0
3          0        0            0
4         30        0            1
5          0        0            0
6         80        0            1
7         80        0            1
8         80        0            1
9          0        0            0
10         0        0            0
11        80        0            1
12        65      450            1