当变量是原子向量时删除负值
Removing negative values when variable is an atomic vector
我有一个很大的调查数据集(最初是一个 Stata(.dta) 文件)。我想使用下面的代码将数据集中的负值转换为 NA。如果一个变量的 NA 超过 99%,代码应该删除它。
#mixed data
WVS <- data.frame(file)
dat <- WVS[,sapply(df, function(x) {class(x)== "numeric" | class(x) == "integer"})]
# NEGATIVES -> NA
foo <- function(dat, p){
ind <- colSums(is.na(dat))/nrow(dat)
dat[dat < 0] <- NA
dat[, ind < p]
}
# process numeric part of the data separately
ii <- sapply(WVS, class) == "numeric"
WVS.num <- foo(as.matrix(WVS[, ii]), 0.99)
# then stick the two parts back together again
WVS <- data.frame(WVS[, !ii], WVS.num)
然而,这并没有起作用,因为它看起来是:
> is("S004")
[1] "character" "vector" "data.frameRowLabels" "SuperClassMethod" "index"
[6] "atomicVector
Str(WVS):
$ S004 :Class 'labelled' atomic [1:50] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
.. ..- attr(*, "label")= chr "Set"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
.. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
我该如何调整我的代码来应对这种情况?
更新:
我已经修改了下面的答案并尝试让它在循环中运行(因为我的数据集对于下面的解决方案来说太大了。
# Creating a column with the same length as the original dataset
WVSc <- data.frame(x = 1:341271, y = c(NA))
# Loop for every column
for(i in 1:ncol(WVS))
# Replace all negatives with NA if possible
{try(WVS[,i] <- NA^(WVS[,i]<0) * WVS[,i])
# Select columns to keep and create a new dataframe from these columns
col_to_keep <- sapply(WVSx[,i], function(x) sum(is.na(x)/length(x))
col_to_keep <- names(col_to_keep[col_to_keep <= 0.99])
WVSc < - cbind(WVS,col_to_keep)}
所以,上面的方法是行不通的。此外,我希望通过循环删除 NA 超过 99% 的列,而不是创建一个新的 df,并绑定 NA 较少的列。
由于您没有提供任何示例,这里是我的靶心解决方案。希望这会给你一些先机:
cleanFun <- function(df){
# set negative values as NA
df[df < 0] <- NA
# faster, vectorized solution
# select numeric columns
num_cols <- names(df)[sapply(df, is.numeric)]
# get name of columns with 99% or more NA values
col_to_remove <- names(df)[colMeans(is.na(df[num_cols]))>=0.9]
# drop those columns
return (df[setdiff(colnames(df),col_to_remove)])
}
your_df <- cleanFun(your_df)
我有一个很大的调查数据集(最初是一个 Stata(.dta) 文件)。我想使用下面的代码将数据集中的负值转换为 NA。如果一个变量的 NA 超过 99%,代码应该删除它。
#mixed data
WVS <- data.frame(file)
dat <- WVS[,sapply(df, function(x) {class(x)== "numeric" | class(x) == "integer"})]
# NEGATIVES -> NA
foo <- function(dat, p){
ind <- colSums(is.na(dat))/nrow(dat)
dat[dat < 0] <- NA
dat[, ind < p]
}
# process numeric part of the data separately
ii <- sapply(WVS, class) == "numeric"
WVS.num <- foo(as.matrix(WVS[, ii]), 0.99)
# then stick the two parts back together again
WVS <- data.frame(WVS[, !ii], WVS.num)
然而,这并没有起作用,因为它看起来是:
> is("S004")
[1] "character" "vector" "data.frameRowLabels" "SuperClassMethod" "index"
[6] "atomicVector
Str(WVS):
$ S004 :Class 'labelled' atomic [1:50] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
.. ..- attr(*, "label")= chr "Set"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
.. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
我该如何调整我的代码来应对这种情况?
更新:
我已经修改了下面的答案并尝试让它在循环中运行(因为我的数据集对于下面的解决方案来说太大了。
# Creating a column with the same length as the original dataset
WVSc <- data.frame(x = 1:341271, y = c(NA))
# Loop for every column
for(i in 1:ncol(WVS))
# Replace all negatives with NA if possible
{try(WVS[,i] <- NA^(WVS[,i]<0) * WVS[,i])
# Select columns to keep and create a new dataframe from these columns
col_to_keep <- sapply(WVSx[,i], function(x) sum(is.na(x)/length(x))
col_to_keep <- names(col_to_keep[col_to_keep <= 0.99])
WVSc < - cbind(WVS,col_to_keep)}
所以,上面的方法是行不通的。此外,我希望通过循环删除 NA 超过 99% 的列,而不是创建一个新的 df,并绑定 NA 较少的列。
由于您没有提供任何示例,这里是我的靶心解决方案。希望这会给你一些先机:
cleanFun <- function(df){
# set negative values as NA
df[df < 0] <- NA
# faster, vectorized solution
# select numeric columns
num_cols <- names(df)[sapply(df, is.numeric)]
# get name of columns with 99% or more NA values
col_to_remove <- names(df)[colMeans(is.na(df[num_cols]))>=0.9]
# drop those columns
return (df[setdiff(colnames(df),col_to_remove)])
}
your_df <- cleanFun(your_df)