R - lm, cooks.distance & 离群值(按组)
R - lm, cooks.distance & Outliers by Group
与外组合作的代码:
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
mod <- lm(ozone_reading ~ ., data=ozone)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
(ozone[influential, ]) # influential observations.
根据我的新要求,我必须添加一个组,并且需要为每个组找到异常值。我的代码示例如下所示。我如何得到 cooks.distance 和小组的异常值?请帮助
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
ozone$season <- c('summer','summer','summer','summer','summer','summer','summer','summer','summer','summer',
'winter','winter','winter','winter','winter','winter','winter','winter','winter','winter')
这里我需要按组计算mod、cooksd和influential。
简单地概括您的过程并使用 by
(tapply
的面向对象包装器)调用它,它通过一个或多个因素对数据帧进行子集化,并将子集传递到函数中以 return 等于不同组数的数据帧列表:
proc_cooks_outlier <- function(df) {
mod <- lm(ozone_reading ~ ., data=transform(df, season=NULL))
cooksd <- cooks.distance(mod)
# influential row numbers
influential <- as.integer(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=TRUE))])
return(df[complete.cases(df[influential,]),])
}
outlier_df_list <- by(ozone, ozone$season, FUN=proc_cooks_outlier)
# REFERENCE INDIVIDUAL DFs
outlier_df_list$summer
outlier_df_list$winter
...
# COMBINE ALL INTO ONE DF
master_outlier_df <- do.call(rbind, unname(outlier_df_list))
与外组合作的代码:
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
mod <- lm(ozone_reading ~ ., data=ozone)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
(ozone[influential, ]) # influential observations.
根据我的新要求,我必须添加一个组,并且需要为每个组找到异常值。我的代码示例如下所示。我如何得到 cooks.distance 和小组的异常值?请帮助
url <- "https://raw.githubusercontent.com/selva86/datasets/master/ozone.csv"
ozone <- read.csv(url)
ozone <- head(ozone,20)
ozone$season <- c('summer','summer','summer','summer','summer','summer','summer','summer','summer','summer',
'winter','winter','winter','winter','winter','winter','winter','winter','winter','winter')
这里我需要按组计算mod、cooksd和influential。
简单地概括您的过程并使用 by
(tapply
的面向对象包装器)调用它,它通过一个或多个因素对数据帧进行子集化,并将子集传递到函数中以 return 等于不同组数的数据帧列表:
proc_cooks_outlier <- function(df) {
mod <- lm(ozone_reading ~ ., data=transform(df, season=NULL))
cooksd <- cooks.distance(mod)
# influential row numbers
influential <- as.integer(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=TRUE))])
return(df[complete.cases(df[influential,]),])
}
outlier_df_list <- by(ozone, ozone$season, FUN=proc_cooks_outlier)
# REFERENCE INDIVIDUAL DFs
outlier_df_list$summer
outlier_df_list$winter
...
# COMBINE ALL INTO ONE DF
master_outlier_df <- do.call(rbind, unname(outlier_df_list))