R 中具有多个条件的列均值、方差和布尔值变异
Column mean, variance and boolean mutation with multiple conditions in R
我有一个长格式的 title-day 面板数据集。在这个复制品中,有三个人给出了分数(1、2 和 3)。对于每个人的分数本身,是否为该标题和日期(布尔值)给出分数以及对分数进行编码的日期。最后一个保留为该人的头衔中的唯一变量常量。当此人未给出任何分数时,将用 NA 表示。在 df1 中查看此处:
title <- c("x","x","x","x","y","y","y","y","z","z","z","z")
day <- c(0,1,2,3,0,1,2,3,0,1,2,3)
avg_score <- c(0,0,0,0,0,0,0,0,0,0,0,0)
variance <- c(0,0,0,0,0,0,0,0,0,0,0,0)
score_or_not <- c(0,0,0,0,0,0,0,0,0,0,0,0)
score_1 <- c(0,0,0,30,NA,NA,NA,NA,0,0,0,50)
score_or_not1 <- c(0,0,0,1,NA,NA,NA,NA,0,0,0,1)
score_day1 <- c(3,3,3,3,NA,NA,NA,NA,3,3,3,3)
score_2 <- c(NA,NA,NA,NA,0,80,80,80,0,0,80,80)
score_or_not2 <- c(NA,NA,NA,NA,0,1,1,1,0,0,1,1)
score_day2 <- c(NA,NA,NA,NA,1,1,1,1,2,2,2,2)
score_3 <- c(0,0,0,0,NA,NA,NA,NA,90,90,90,90)
score_or_not3 <- c(0,0,0,0,NA,NA,NA,NA,1,1,1,1)
score_day3 <- c(-2,-2,-2,-2,NA,NA,NA,NA,0,0,0,0)
df1 <- data.frame(title,day,avg_score,variance,score_or_not,score_1,score_or_not1,score_day1,score_2,score_or_not2,score_day2,score_3,score_or_not3,score_day3)
我遇到了以下问题。我需要三个基于这些给定分数的新列(avg_score、方差和 score_or_not)。但是,有一些条件,即当 score_day 为 负数或零时 分数应 而不是 被考虑用于新的列,并且应该像 NA 列一样被忽略。重要的是 NA 值保持 NA 并且负值或 0 值也保持不变。
这里是对三个新变量的描述:
1. avg_score 应该成为所有给出的分数的平均分数,只有当它们满足条件时。当只有一个分数时,该分数应该是avg_score的值。
2. 当没有或只有一个分数可用时,方差应为 0。当有 2 个或更多时,应在此列中计算方差。
3. Score_or_not 应该是一个布尔值,我们可以在其中查看当天是否有分数。当然也考虑到了条件。
结果应如下所示:
title <- c("x","x","x","x","y","y","y","y","z","z","z","z")
day <- c(0,1,2,3,0,1,2,3,0,1,2,3)
avg_score <- c(0,0,0,30,0,80,80,80,0,0,80,65)
variance <- c(0,0,0,0,0,0,0,0,0,0,0,450)
score_or_not <- c(0,0,0,1,0,1,1,1,0,0,1,1)
score_1 <- c(0,0,0,30,NA,NA,NA,NA,0,0,0,50)
score_or_not1 <- c(0,0,0,1,NA,NA,NA,NA,0,0,0,1)
score_day1 <- c(3,3,3,3,NA,NA,NA,NA,3,3,3,3)
score_2 <- c(NA,NA,NA,NA,0,80,80,80,0,0,80,80)
score_or_not2 <- c(NA,NA,NA,NA,0,1,1,1,0,0,1,1)
score_day2 <- c(NA,NA,NA,NA,1,1,1,1,2,2,2,2)
score_3 <- c(0,0,0,0,NA,NA,NA,NA,90,90,90,90)
score_or_not3 <- c(0,0,0,0,NA,NA,NA,NA,1,1,1,1)
score_day3 <- c(-2,-2,-2,-2,NA,NA,NA,NA,0,0,0,0)
Output <- data.frame(title,day,avg_score,variance,score_or_not,score_1,score_or_not1,score_day1,score_2,score_or_not2,score_day2,score_3,score_or_not3,score_day3)
希望你们能解决这个具体问题..
可能最容易重塑,然后根据您的条件对所有 3 个人进行计算,然后加入原始数据框。
library(dplyr)
library(tidyr)
left_join(df1,
pivot_longer(df1, cols=-c(title, day),
names_to=c(".value","person"),
names_pattern="(.*)(\d)") %>%
filter(score_day>0 & score_or_not==1) %>%
group_by(title, day) %>%
summarise(avg_score=mean(score_, na.rm=TRUE),
variance=var(score_, na.rm=TRUE),
score_or_not=+(avg_score>0)),
by=c('title','day')) %>%
mutate(avg_score=replace_na(avg_score,0),
variance=replace_na(variance, 0),
score_or_not=replace_na(score_or_not, 0))
结果:
...
avg_score variance score_or_not
1 0 0 0
2 0 0 0
3 0 0 0
4 30 0 1
5 0 0 0
6 80 0 1
7 80 0 1
8 80 0 1
9 0 0 0
10 0 0 0
11 80 0 1
12 65 450 1
我有一个长格式的 title-day 面板数据集。在这个复制品中,有三个人给出了分数(1、2 和 3)。对于每个人的分数本身,是否为该标题和日期(布尔值)给出分数以及对分数进行编码的日期。最后一个保留为该人的头衔中的唯一变量常量。当此人未给出任何分数时,将用 NA 表示。在 df1 中查看此处:
title <- c("x","x","x","x","y","y","y","y","z","z","z","z")
day <- c(0,1,2,3,0,1,2,3,0,1,2,3)
avg_score <- c(0,0,0,0,0,0,0,0,0,0,0,0)
variance <- c(0,0,0,0,0,0,0,0,0,0,0,0)
score_or_not <- c(0,0,0,0,0,0,0,0,0,0,0,0)
score_1 <- c(0,0,0,30,NA,NA,NA,NA,0,0,0,50)
score_or_not1 <- c(0,0,0,1,NA,NA,NA,NA,0,0,0,1)
score_day1 <- c(3,3,3,3,NA,NA,NA,NA,3,3,3,3)
score_2 <- c(NA,NA,NA,NA,0,80,80,80,0,0,80,80)
score_or_not2 <- c(NA,NA,NA,NA,0,1,1,1,0,0,1,1)
score_day2 <- c(NA,NA,NA,NA,1,1,1,1,2,2,2,2)
score_3 <- c(0,0,0,0,NA,NA,NA,NA,90,90,90,90)
score_or_not3 <- c(0,0,0,0,NA,NA,NA,NA,1,1,1,1)
score_day3 <- c(-2,-2,-2,-2,NA,NA,NA,NA,0,0,0,0)
df1 <- data.frame(title,day,avg_score,variance,score_or_not,score_1,score_or_not1,score_day1,score_2,score_or_not2,score_day2,score_3,score_or_not3,score_day3)
我遇到了以下问题。我需要三个基于这些给定分数的新列(avg_score、方差和 score_or_not)。但是,有一些条件,即当 score_day 为 负数或零时 分数应 而不是 被考虑用于新的列,并且应该像 NA 列一样被忽略。重要的是 NA 值保持 NA 并且负值或 0 值也保持不变。
这里是对三个新变量的描述: 1. avg_score 应该成为所有给出的分数的平均分数,只有当它们满足条件时。当只有一个分数时,该分数应该是avg_score的值。 2. 当没有或只有一个分数可用时,方差应为 0。当有 2 个或更多时,应在此列中计算方差。 3. Score_or_not 应该是一个布尔值,我们可以在其中查看当天是否有分数。当然也考虑到了条件。
结果应如下所示:
title <- c("x","x","x","x","y","y","y","y","z","z","z","z")
day <- c(0,1,2,3,0,1,2,3,0,1,2,3)
avg_score <- c(0,0,0,30,0,80,80,80,0,0,80,65)
variance <- c(0,0,0,0,0,0,0,0,0,0,0,450)
score_or_not <- c(0,0,0,1,0,1,1,1,0,0,1,1)
score_1 <- c(0,0,0,30,NA,NA,NA,NA,0,0,0,50)
score_or_not1 <- c(0,0,0,1,NA,NA,NA,NA,0,0,0,1)
score_day1 <- c(3,3,3,3,NA,NA,NA,NA,3,3,3,3)
score_2 <- c(NA,NA,NA,NA,0,80,80,80,0,0,80,80)
score_or_not2 <- c(NA,NA,NA,NA,0,1,1,1,0,0,1,1)
score_day2 <- c(NA,NA,NA,NA,1,1,1,1,2,2,2,2)
score_3 <- c(0,0,0,0,NA,NA,NA,NA,90,90,90,90)
score_or_not3 <- c(0,0,0,0,NA,NA,NA,NA,1,1,1,1)
score_day3 <- c(-2,-2,-2,-2,NA,NA,NA,NA,0,0,0,0)
Output <- data.frame(title,day,avg_score,variance,score_or_not,score_1,score_or_not1,score_day1,score_2,score_or_not2,score_day2,score_3,score_or_not3,score_day3)
希望你们能解决这个具体问题..
可能最容易重塑,然后根据您的条件对所有 3 个人进行计算,然后加入原始数据框。
library(dplyr)
library(tidyr)
left_join(df1,
pivot_longer(df1, cols=-c(title, day),
names_to=c(".value","person"),
names_pattern="(.*)(\d)") %>%
filter(score_day>0 & score_or_not==1) %>%
group_by(title, day) %>%
summarise(avg_score=mean(score_, na.rm=TRUE),
variance=var(score_, na.rm=TRUE),
score_or_not=+(avg_score>0)),
by=c('title','day')) %>%
mutate(avg_score=replace_na(avg_score,0),
variance=replace_na(variance, 0),
score_or_not=replace_na(score_or_not, 0))
结果:
...
avg_score variance score_or_not
1 0 0 0
2 0 0 0
3 0 0 0
4 30 0 1
5 0 0 0
6 80 0 1
7 80 0 1
8 80 0 1
9 0 0 0
10 0 0 0
11 80 0 1
12 65 450 1