使用 r 替换完整数据集中的 na 值

Question

我正在处理一个数据集，其中缺少一些标记为“？”的值，我必须用该列的最常见值（模式）替换它们。但是，我想编写一个代码来一次为整个数据集运行它。

到目前为止我已经得到了 -

df <- read.csv("mushroom.txt", na.strings = "?",header=FALSE)

现在，尝试用该列的模式替换文件中的所有 NA 值。请帮忙。

Answer 1

replaceQuestions <- function(vector) {

  mostCommon <- names(sort(table(vector), decreasing = TRUE))[1]

  vector[vector == '?'] <- mostCommon

  vector

}

df <- apply(df, 2, replaceQuestions)

不可重现，所以我不确定这是否是您要找的，但这解决了我所解释的问题。

Answer 2

由于您想用列的模式替换，您希望通过应用以列方式操作并使用 is.na 来识别要替换的列。

apply(df, 2, function(x){ 
    x[is.na(x)] <- names(which.max(table(x)))
    return(x) })

请注意 apply returns 和 matrix，因此如果您想要 data.frame，则需要使用 as.data.frame

进行转换

Answer 3

正如您在问题中提到的那样，您在读取 csv 期间将 NA 替换为 "?"，因此我认为这可能会有所帮助：

apply(df,2,function(x) gsub("\?",names(sort(-table(x,exclude="?")))[1],x))

exclude部分是为了避免选择"?"，它应该是最频繁的值。 \ 是为了转义特殊字符 ?到 gsub.

====== 编辑添加 ======

gsub 会将所有内容转换为文本，您需要再次将其恢复为数字：

a<-apply(df,2,function(x) gsub("\?",names(sort(-table(x,exclude="?")))[1],x))
new_df<-as.data.frame(apply(a,2,as.numeric))

最后一行将产生一个新的数据框

Answer 4

或者：

apply(df, 2, function(x) {
  x[is.na(x)] <- Mode(x[complete.cases(x)])
  x})

这在 SO 上使用了著名的 Mode 函数。 Link 到函数 Is there a built-in function for finding the mode?

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Answer 5

使用

for (i in ncol(dataframename){
   dataframename[i]=
   ifelse(is.na(dataframename[i]),mode(dataframename[i]),dataframename[i])
}

使用 r 替换完整数据集中的 na 值

replace na values in full dataset using r

r

data-manipulation