使用 R 中的线性回归按组填充缺失值

Question

我有一个包含大约 50 列（我从世界银行获得的所有指标）、国家代码和年份的数据集。这 50 列并不都是完整的，我想根据适合该特定国家/地区的列的 lm 填充缺失值。例如：

在执行以下步骤时，对单个国家/地区和单个列执行此操作绝对没问题：

但是，我有 180 多个不同的国家/地区要这样做。我希望这适用于每个国家/地区的每个指标（总共 50 列）所以在某种程度上，每个国家/地区和每一列都有自己的线性回归模型来填充缺失值。

这是我执行上述步骤后的样子：这是一列的预期输出。我想针对各个国家/地区组的每个列执行此操作。

然而，数据是这样的：

就像上面的 post 一样，我想在许多国家和列上执行此操作。

这是我正在进行的数据挖掘/统计项目 class。任何帮助将不胜感激，并提前致谢！

编辑

我试过这个：

# Reading CSV File
WorldBank=read.csv(file.choose(),header=T, stringsAsFactors=TRUE)
colnames(WorldBank)[1] <- "Country_Code"
colnames(WorldBank)[2] <- "Year"



Countries <- t.data.frame(distinct((WorldBank)[1])) # to create an 
array of distinct countries I can loop over
CountryCount <- ncol(Countries) #this is what I intend to use
CountryCountDemo <- 1 # this is what I used for this example to only look at one country.
# Making new data
Worldbankone <- WorldBank

#CountryVarList <- names(Worldbankone[,3:57]) # Column 3 to 57. Not including year and country code.
df_total = data.frame() # Final data that will hold each iteration

for (i in 1:CountryCountDemo) {
  for (j in names(Worldbankone[,3:57])){

      country_temp <- Countries[i]
      Worldbanktemp <- Worldbankone %>% filter(Country_Code == 
      country_temp)
# If entire column has a null, make it all 0 (this needs to happen to perform linear regression on each column with all NAs)

if (all(is.na(c(Worldbanktemp[[j]]))) == TRUE) {
  
  Worldbanktemp[[j]] <- 0 
  
} else {
  
 


# Staging data to keep all values that do not have an NA value for that column for linear regression model
Worldbanktemp2 <- Worldbanktemp %>% filter(!is.na(.[[j]]))

# Creating fit. The only way to incorporate a variable like 'j' is by pasting it in.
fit <- lm(formula = paste0("`",j,"` ~ Year"), data = Worldbanktemp2)
# making a list that holds all the predicted values
pred = predict(fit,Worldbanktemp)



# Here is where things go wrong...
# I cannot used mutate here. It has no impact on the Worldbanktemp3 when I mutate Worldbanktemp. I tried both if_else and ifelse. I also tried as.name(j) and unquote(j) and !!as.name(j). 
Worldbanktemp3 <- Worldbanktemp %>% 
  # Replace NA with pred in var1
  mutate(!!j := if_else(is.na(!!as.name(j)), pred, !!as.name(j)))}
#df_total = rbind(df_total,Worldbanktemp3)}
#Fitting code before this
#Worldbankresult <-rbind(Worldbankresult,Worldbanktemp2)


}}

编辑 2

# Reproducible example
sampledata2 = structure(list(Country_Code = structure(c(3L, 3L, 3L, 3L, 3L, 
                                                3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
                                                5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
                                                5L, 5L, 5L, 5L), .Label = c("", "ABW", "AFG", "AGO", "ALB", "AND", 
                                                                            "ARE", "ARG", "ARM", "ASM", "ATG", "AUS", "AUT", "AZE", "BDI", 
                                                                            "BEL", "BEN", "BFA", "BGD", "BGR", "BHR", "BHS", "BIH", "BLR", 
                                                                            "BLZ", "BMU", "BOL", "BRA", "BRB", "BRN", "BTN", "BWA", "CAF", 
                                                                            "CAN", "CHE", "CHI", "CHL", "CHN", "CIV", "CMR", "COD", "COG", 
                                                                            "COL", "COM", "CPV", "CRI", "CUB", "CUW", "CYM", "CYP", "CZE", 
                                                                            "DEU", "DJI", "DMA", "DNK", "DOM", "DZA", "ECU", "EGY", "ERI", 
                                                                            "ESP", "EST", "ETH", "FIN", "FJI", "FRA", "FRO", "FSM", "GAB", 
                                                                            "GBR", "GEO", "GHA", "GIB", "GIN", "GMB", "GNB", "GNQ", "GRC", 
                                                                            "GRD", "GRL", "GTM", "GUM", "GUY", "HKG", "HND", "HRV", "HTI", 
                                                                            "HUN", "IDN", "IMN", "IND", "IRL", "IRN", "IRQ", "ISL", "ISR", 
                                                                            "ITA", "JAM", "JOR", "JPN", "KAZ", "KEN", "KGZ", "KHM", "KIR", 
                                                                            "KNA", "KOR", "KWT", "LAO", "LBN", "LBR", "LBY", "LCA", "LIE", 
                                                                            "LKA", "LSO", "LTU", "LUX", "LVA", "MAC", "MAF", "MAR", "MCO", 
                                                                            "MDA", "MDG", "MDV", "MEX", "MHL", "MKD", "MLI", "MLT", "MMR", 
                                                                            "MNE", "MNG", "MNP", "MOZ", "MRT", "MUS", "MWI", "MYS", "NAM", 
                                                                            "NCL", "NER", "NGA", "NIC", "NLD", "NOR", "NPL", "NRU", "NZL", 
                                                                            "OMN", "PAK", "PAN", "PER", "PHL", "PLW", "PNG", "POL", "PRI", 
                                                                            "PRK", "PRT", "PRY", "PSE", "PYF", "QAT", "ROU", "RUS", "RWA", 
                                                                            "SAU", "SDN", "SEN", "SGP", "SLB", "SLE", "SLV", "SMR", "SOM", 
                                                                            "SRB", "SSD", "STP", "SUR", "SVK", "SVN", "SWE", "SWZ", "SXM", 
                                                                            "SYC", "SYR", "TCA", "TCD", "TGO", "THA", "TJK", "TKM", "TLS", 
                                                                            "TON", "TTO", "TUN", "TUR", "TUV", "TZA", "UGA", "UKR", "URY", 
                                                                            "USA", "UZB", "VCT", "VEN", "VGB", "VIR", "VNM", "VUT", "WSM", 
                                                                            "XKX", "YEM", "ZAF", "ZMB", "ZWE"), class = "factor"), Year = c(2000L, 
                                                                                                                                            2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 
                                                                                                                                            2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 
                                                                                                                                            2019L, 2020L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 
                                                                                                                                            2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 
                                                                                                                                            2016L, 2017L, 2018L, 2019L), EG.ELC.ACCS.ZS = c(NA, NA, NA, NA, 
                                                                                                                                                                                            NA, 22.29526901, 28.09996223, 33.9016037, 42.4, 45.52068329, 
                                                                                                                                                                                            42.7, 43.22201891, 69.1, 68.98294067, 89.5, 71.5, 97.7, 97.7, 
                                                                                                                                                                                            98.71562195, 97.7, NA, 100, 100, 100, 100, 100, 100, 100, 100, 
                                                                                                                                                                                            100, 100, 100, 100, 99.9, 100, 99.95, 99.98, 99.89, 99.89, 100, 
                                                                                                                                                                                            100), NY.GDP.COAL.RT.ZS = c(NA, NA, 0.004340991, 0.007421791, 
                                                                                                                                                                                                                        0.016454885, 0.010903631, 0.011152411, 0.076415204, 0.224465173, 
                                                                                                                                                                                                                        0.126422321, 0.212188085, 0.495356516, 0.270285101, 0.237255122, 
                                                                                                                                                                                                                        0.226599986, 0.162117663, 0.24730234, 0.397626964, 0.535542773, 
                                                                                                                                                                                                                        0.373839671, NA, 0.000620564, 0.001277136, 0.000274595, 0.000890331, 
                                                                                                                                                                                                                        0.006822021, 0.003077833, 0.002931413, 0.004030065, 0.013985094, 
                                                                                                                                                                                                                        0.000729611, 0.001394767, 0.000801741, 0.000360916, 0.000108033, 
                                                                                                                                                                                                                        0, 0.005924115, 0.00038768, 0.015576529, 0.033583408, 0.024377128
                                                                                                                                                                                            ), NY.GDP.FRST.RT.ZS = c(NA, NA, 0.958003595, 0.664331186, 0.387787089, 
                                                                                                                                                                                                                     0.332204753, 0.454077242, 0.342748815, 0.353698168, 0.274778262, 
                                                                                                                                                                                                                     0.358436235, 0.309317612, 0.263852446, 0.259193525, 0.267631558, 
                                                                                                                                                                                                                     0.290360037, 0.397862173, 0.26727029, 0.335310319, 0.367809087, 
                                                                                                                                                                                                                     NA, 0.126636828, 0.062644576, 0.066551134, 0.062998763, 0.053996825, 
                                                                                                                                                                                                                     0.04788751, 0.054154854, 0.074771754, 0.069714105, 0.074808569, 
                                                                                                                                                                                                                     0.065976686, 0.210223146, 0.182521796, 0.165152821, 0.169255817, 
                                                                                                                                                                                                                     0.191365261, 0.15742073, 0.176432723, 0.169991392, 0.16648548
                                                                                                                                                                                            )), row.names = c(NA, 41L), class = "data.frame")

Answer 1

由于您已经知道如何针对一个国家/地区的一个数据框执行此操作，因此您已经非常接近您的解决方案了。但要让这对您自己变得轻松，您需要做一些事情。

使用 dput 创建一个 reproducible example。 janitor 库具有 clean_names() 函数来修复列名。
编写您自己的插值函数，以一个国家/地区的数据框作为输入，returns一个国家/地区的插值数据框。
Pivot_longer 将所有数据列合并到一个参数化列中。
使用 dplyr 函数 group_split 获取您的大型多国数据框，并将其分解为一个数据框列表，每个国家和参数一个。
使用purrr函数map将列表中的每个数据帧映射到新的插值数据帧列表。
使用 dplyr 的 bind_rows to convert the list interpolated dataframes back into one dataframe, and pivot_wider 恢复原始数据形状。


library(tidyverse)
library(purrr)
library(janitor)

my_country_interpolater<-function(single_country_df){
  
  data_to_build_model<-single_country_df %>%
    filter(!is.na(value)) %>%
    select(year,value)
  
  years_to_interpolate<-single_country_df %>%
    filter(is.na(value)) %>%
    select(year)
  
  fit<-lm(value ~ year, data = data_to_build_model)
  value = predict(fit,years_to_interpolate)
  
  
  interpolated_data<-tibble(years_to_interpolate, value)
  
  single_country_interpolated_df<-bind_rows(data_to_build_model,interpolated_data) %>% 
    mutate(country_code=single_country_df$country_code[1]) %>%
    mutate(parameter=single_country_df$parameter[1]) %>%  # added this for the additional parameters
    select(country_code, year, parameter, value) %>%
    arrange(year) 
  
  return (single_country_interpolated_df)
}

interpolated_df <-sampledata2 %>%
  clean_names() %>% 
  pivot_longer(cols=c(3:5),names_to = "parameter", values_to="value") %>%
  group_by(country_code,parameter) %>%
  group_split() %>% 
 # map(preprocess_data) %>% if you need a preprocessing step
  map(my_country_interpolater) %>%
  bind_rows() %>%
  pivot_wider(names_from = parameter, values_from=value, names_glue = "{parameter}_interp")

使用 R 中的线性回归按组填充缺失值

Fill missing values by group using linear regression in R

r

linear-regression

missing-data

na