在分组数据框中输入缺失值
Imputing missing values in a grouped dataframe
我在分组数据框中输入缺失值。
在 DF
中,Var1
和 Var2
的缺失值是随机的。
数据框按变量分组 Factory:MachineNum
。
插补是按照这些分组中 Odometer
的顺序完成的。
代码在大约 5-10% 的时间内完美运行。其他 90-95% 的时间它说;
"Error: Column Impute must be length 50 (the group size) or one, not 49".
我觉得可能跟缺失值的随机性有关。也许当至少 1 行共享 2 个缺失值时。
如何使此代码更健壮?
通过多次 运行 整个代码,您将看到它在大约 5 - 10% 的尝试中有效,最终将生成 Results
数据帧。
library(dplyr)
library(tidyr)
# Create dataframe with some missing values in Var1 and Var2
DF <- data.frame(Factory = c(replicate(150,"Factory_A"), replicate(150,"Factory_B")),
MachineNum = c(replicate(100,"Machine01"), replicate(100,"Machine02"), replicate(100,"Machine03")),
Odometer = c(replicate(1,sample(1:1000,100,rep=FALSE)), replicate(1,sample(5000:7000,100,rep=FALSE)), replicate(1,sample(10000:11500,100,rep=FALSE))),
Var1 =c(replicate(1, sample(c(2:10, NA), 100, rep = TRUE)), replicate(1, sample(c(15:20, NA), 100, rep = TRUE)), replicate(1, sample(c(18:24, NA), 100, rep = TRUE))),
Var2 = c(replicate(1, sample(c(110:130, NA), 100, rep = TRUE)), replicate(1, sample(c(160:170, NA), 100, rep = TRUE)), replicate(1, sample(c(220:230, NA), 100, rep = TRUE)))
)
# Variables with missing values that need imputing
cols <- grep('Var', names(DF), value = TRUE)
# Group-wise impution of missing values
library(stinepack)
Models <- DF %>%
pivot_longer(cols = starts_with('Var')) %>%
arrange(Factory, MachineNum, name, Odometer) %>%
group_by(Factory, MachineNum, name) %>%
mutate(Impute = na.stinterp(value, along = time(Odometer), na.rm = TRUE))
# Convert results from long to wide to visually inspect
Results <- Models %>%
group_by(Factory, MachineNum, name) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = name, values_from = c(value, Impute))
当你在一个组中有前导和尾随 NA
时会发生错误,因为你有 na.rm = TRUE
它会删除它们使组不平衡。
如果您将 na.rm
保持为 FALSE
,它将保持 NA
为 NA
和 运行 而不会出错。
library(dplyr)
library(stinepack)
DF %>%
pivot_longer(cols = starts_with('Var')) %>%
arrange(Factory, MachineNum, name, Odometer) %>%
group_by(Factory, MachineNum, name) %>%
mutate(Impute = na.stinterp(value, along = time(Odometer), na.rm = FALSE))
我在分组数据框中输入缺失值。
在 DF
中,Var1
和 Var2
的缺失值是随机的。
数据框按变量分组 Factory:MachineNum
。
插补是按照这些分组中 Odometer
的顺序完成的。
代码在大约 5-10% 的时间内完美运行。其他 90-95% 的时间它说;
"Error: Column Impute must be length 50 (the group size) or one, not 49".
我觉得可能跟缺失值的随机性有关。也许当至少 1 行共享 2 个缺失值时。
如何使此代码更健壮?
通过多次 运行 整个代码,您将看到它在大约 5 - 10% 的尝试中有效,最终将生成 Results
数据帧。
library(dplyr)
library(tidyr)
# Create dataframe with some missing values in Var1 and Var2
DF <- data.frame(Factory = c(replicate(150,"Factory_A"), replicate(150,"Factory_B")),
MachineNum = c(replicate(100,"Machine01"), replicate(100,"Machine02"), replicate(100,"Machine03")),
Odometer = c(replicate(1,sample(1:1000,100,rep=FALSE)), replicate(1,sample(5000:7000,100,rep=FALSE)), replicate(1,sample(10000:11500,100,rep=FALSE))),
Var1 =c(replicate(1, sample(c(2:10, NA), 100, rep = TRUE)), replicate(1, sample(c(15:20, NA), 100, rep = TRUE)), replicate(1, sample(c(18:24, NA), 100, rep = TRUE))),
Var2 = c(replicate(1, sample(c(110:130, NA), 100, rep = TRUE)), replicate(1, sample(c(160:170, NA), 100, rep = TRUE)), replicate(1, sample(c(220:230, NA), 100, rep = TRUE)))
)
# Variables with missing values that need imputing
cols <- grep('Var', names(DF), value = TRUE)
# Group-wise impution of missing values
library(stinepack)
Models <- DF %>%
pivot_longer(cols = starts_with('Var')) %>%
arrange(Factory, MachineNum, name, Odometer) %>%
group_by(Factory, MachineNum, name) %>%
mutate(Impute = na.stinterp(value, along = time(Odometer), na.rm = TRUE))
# Convert results from long to wide to visually inspect
Results <- Models %>%
group_by(Factory, MachineNum, name) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = name, values_from = c(value, Impute))
当你在一个组中有前导和尾随 NA
时会发生错误,因为你有 na.rm = TRUE
它会删除它们使组不平衡。
如果您将 na.rm
保持为 FALSE
,它将保持 NA
为 NA
和 运行 而不会出错。
library(dplyr)
library(stinepack)
DF %>%
pivot_longer(cols = starts_with('Var')) %>%
arrange(Factory, MachineNum, name, Odometer) %>%
group_by(Factory, MachineNum, name) %>%
mutate(Impute = na.stinterp(value, along = time(Odometer), na.rm = FALSE))