在 R 中按日期估算数据的中位数

imputing data with median by date in R

我需要用在特定日期(按 "date" 分组)计算的 "steps" 的中位数替换字段 "steps" 中的缺失值,并删除 NA 值。我已经提到了这个 thread 但我的 NA 值没有被替换。有人可以帮我找出我哪里出错了吗?我更喜欢使用 base package/data table/plyr。数据集看起来大约。像这样:-

        steps      date interval
    1:    NA 2012-10-01        0
    2:    NA 2012-10-01        5
    3:    NA 2012-10-01       10
    4:    NA 2012-10-01       15
    5:    NA 2012-10-01       20
   ---                          
17564:    NA 2012-11-30     2335
17565:    NA 2012-11-30     2340
17566:    NA 2012-11-30     2345
17567:    NA 2012-11-30     2350
17568:    NA 2012-11-30     2355

数据集(activity)的结构和概要如下图

 #str(activity)  
 Classes ‘data.table’ and 'data.frame': 17568 obs. of  3 variables:
     $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
     $ date    : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ...
     $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

#summary(activity)
         steps             date               interval     
     Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
     1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
     Median :  0.00   Median :2012-10-31   Median :1177.5  
     Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
     3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
     Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
     NA's   :2304     

我尝试过的事情:

数据表方法:

activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date]
summary(activityrepNA)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304 

使用平均值

activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)]
> summary(activity)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304

尝试 ddply

cleandatapls<-ddply(activity, 
+       .(as.character(date)), 
+       transform, 
+       steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps))
> summary(cleandatapls)
 as.character(date)     steps             date               interval     
 Length:17568       Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 Class :character   1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Mode  :character   Median :  0.00   Median :2012-10-31   Median :1177.5  
                    Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
                    3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
                    Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
                    NA's   :2304   

用于计算中位数的聚合

whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE)
> summary(whynoclean)
    Group.1               steps        date               interval   
 Min.   :2012-10-01   Min.   :0   Min.   :2012-10-01   Min.   :1178  
 1st Qu.:2012-10-16   1st Qu.:0   1st Qu.:2012-10-16   1st Qu.:1178  
 Median :2012-10-31   Median :0   Median :2012-10-31   Median :1178  
 Mean   :2012-10-31   Mean   :0   Mean   :2012-10-31   Mean   :1178  
 3rd Qu.:2012-11-15   3rd Qu.:0   3rd Qu.:2012-11-15   3rd Qu.:1178  
 Max.   :2012-11-30   Max.   :0   Max.   :2012-11-30   Max.   :1178  
                      NA's   :8                     

使用 mutate

编辑代码输出
activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))
Source: local data table [17,568 x 3]

   steps       date interval
1     NA 2012-10-01        0
2     NA 2012-10-01        5
3     NA 2012-10-01       10
4     NA 2012-10-01       15
5     NA 2012-10-01       20
6     NA 2012-10-01       25
7     NA 2012-10-01       30
8     NA 2012-10-01       35
9     NA 2012-10-01       40
10    NA 2012-10-01       45
..   ...        ...      ... 

更新:

Steven Beaupre 帮助我意识到我的估算方法存在缺陷,因为某些特定日期只有 NA 值导致了问题,因为 NA 的中位数是 NA。使用了另一种建议的方法。

尝试:

library(dplyr)
df %>% 
  group_by(date) %>% 
  mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps))

如果对于给定日期,所有步骤都是 NAs,您可以将它们替换为 0:

df %>% 
  group_by(date) %>% 
  mutate(steps = ifelse(all(is.na(steps)), 0,
                        ifelse(is.na(steps), median(steps, na.rm = T), steps)))