R data.table 将 NA 替换为数字列的平均值和标称值的最常见值
R data.table replace NA with mean for numeric columns and most frequent value for nominal values
我有以下 data.table
x = structure(list(id1 = c("a", "a", "a", "b", "b", NA), id2 = c(2, 3, NA,3, 4, 5)), .Names = c("id1", "id2"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x1fe4a78>)
我正在尝试用单独的策略替换每列中的 NA
。对于数字列,我想用 mean
替换它,对于 factor
或 character
列,我想用最频繁的值替换它。我尝试了以下方法,但它什么也没做。
for (j in 1:ncol(x)){
if(is.numeric(unlist(x[,j,with=FALSE]))){
m = mean(unlist(x[,j,with=FALSE]))
set(x,which(is.na(x[[j]])),j,m)
}else{
m = sort(table(x),decreasing=TRUE)[[1]]
set(x,which(is.na(x[[j]])),j,m)
}
使用基本方法,您可以编写如下函数:
myFun <- function(x) {
if (is.numeric(x)) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
} else {
x[is.na(x)] <- names(which.max(table(x)))
x
}
}
... 并应用:
x[, lapply(.SD, myFun)]
# id1 id2
# 1: a 2.0
# 2: a 3.0
# 3: a 3.4
# 4: b 3.0
# 5: b 4.0
# 6: a 5.0
请注意,如果存在并列,which.max
将取第一个最大值。
我想也可以这样写:
myFun <- function(inDT) {
for (i in 1:ncol(inDT)) {
temp <- unlist(inDT[, i, with = FALSE], use.names = FALSE)
set(inDT, which(is.na(temp)), i,
if (is.numeric(temp)) {
mean(temp, na.rm = TRUE)
} else {
names(which.max(table(temp)))
} )
}
inDT
}
y <- copy(x)
myFun(y)
# id1 id2
# 1: a 2.0
# 2: a 3.0
# 3: a 3.4
# 4: b 3.0
# 5: b 4.0
# 6: a 5.0
我有以下 data.table
x = structure(list(id1 = c("a", "a", "a", "b", "b", NA), id2 = c(2, 3, NA,3, 4, 5)), .Names = c("id1", "id2"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x1fe4a78>)
我正在尝试用单独的策略替换每列中的 NA
。对于数字列,我想用 mean
替换它,对于 factor
或 character
列,我想用最频繁的值替换它。我尝试了以下方法,但它什么也没做。
for (j in 1:ncol(x)){
if(is.numeric(unlist(x[,j,with=FALSE]))){
m = mean(unlist(x[,j,with=FALSE]))
set(x,which(is.na(x[[j]])),j,m)
}else{
m = sort(table(x),decreasing=TRUE)[[1]]
set(x,which(is.na(x[[j]])),j,m)
}
使用基本方法,您可以编写如下函数:
myFun <- function(x) {
if (is.numeric(x)) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
} else {
x[is.na(x)] <- names(which.max(table(x)))
x
}
}
... 并应用:
x[, lapply(.SD, myFun)]
# id1 id2
# 1: a 2.0
# 2: a 3.0
# 3: a 3.4
# 4: b 3.0
# 5: b 4.0
# 6: a 5.0
请注意,如果存在并列,which.max
将取第一个最大值。
我想也可以这样写:
myFun <- function(inDT) {
for (i in 1:ncol(inDT)) {
temp <- unlist(inDT[, i, with = FALSE], use.names = FALSE)
set(inDT, which(is.na(temp)), i,
if (is.numeric(temp)) {
mean(temp, na.rm = TRUE)
} else {
names(which.max(table(temp)))
} )
}
inDT
}
y <- copy(x)
myFun(y)
# id1 id2
# 1: a 2.0
# 2: a 3.0
# 3: a 3.4
# 4: b 3.0
# 5: b 4.0
# 6: a 5.0