随机或按比例为 NA 分配分类值
Assigning categorical values to NAs randomly or proportionally
我有一个数据集:
df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
"male"), Division = c("South Atlantic", "East North Central",
"Pacific", "East North Central", "South Atlantic", "South Atlantic",
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538,
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn",
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
我需要执行分析,以便 gender
变量中不能有 NA
值。其他列太少并且没有已知的预测值,因此实际上不可能估算这些值。
我可以通过完全删除不完整的观察来执行分析——它们大约占数据集的 4%,但我想通过随机分配 female
或 male
来查看结果失踪案件。
除了编写一些非常丑陋的代码来过滤不完整的案例,将其分成两部分并在每一半中用 female
或 male
替换 NA
之外,我想知道是否有是一种将值随机或按比例分配给 NA
s 的优雅方法吗?
我们可以用ifelse
和is.na
判断na
是否存在,然后用sample
随机selectfemale
和male
.
df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
这个怎么样:
> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
+ "male"),
+ Division = c("South Atlantic", "East North Central",
+ "Pacific", "East North Central", "South Atlantic", "South Atlantic",
+ "Pacific"),
+ Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+ 107683.9118, 56149.3217, 46237.265),
+ first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+ row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
>
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
>
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
>
> df$gender
[1] "female" "male" "female" "female" "male" "male" "male"
>
这是随机的,具有给定的概率。您还可以考虑使用最近的邻居、移动办公桌或类似方法来估算值。
希望对您有所帮助。
只需赋值
df$gender[is.na(df$gender)]=sample(c("female", "male"), dim(df)[1], replace = TRUE)[is.na(df$gender)]
我有一个数据集:
df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
"male"), Division = c("South Atlantic", "East North Central",
"Pacific", "East North Central", "South Atlantic", "South Atlantic",
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538,
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn",
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
我需要执行分析,以便 gender
变量中不能有 NA
值。其他列太少并且没有已知的预测值,因此实际上不可能估算这些值。
我可以通过完全删除不完整的观察来执行分析——它们大约占数据集的 4%,但我想通过随机分配 female
或 male
来查看结果失踪案件。
除了编写一些非常丑陋的代码来过滤不完整的案例,将其分成两部分并在每一半中用 female
或 male
替换 NA
之外,我想知道是否有是一种将值随机或按比例分配给 NA
s 的优雅方法吗?
我们可以用ifelse
和is.na
判断na
是否存在,然后用sample
随机selectfemale
和male
.
df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
这个怎么样:
> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
+ "male"),
+ Division = c("South Atlantic", "East North Central",
+ "Pacific", "East North Central", "South Atlantic", "South Atlantic",
+ "Pacific"),
+ Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+ 107683.9118, 56149.3217, 46237.265),
+ first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+ row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
>
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
>
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
>
> df$gender
[1] "female" "male" "female" "female" "male" "male" "male"
>
这是随机的,具有给定的概率。您还可以考虑使用最近的邻居、移动办公桌或类似方法来估算值。
希望对您有所帮助。
只需赋值
df$gender[is.na(df$gender)]=sample(c("female", "male"), dim(df)[1], replace = TRUE)[is.na(df$gender)]