如何运行在不同时间开始的列的多元线性回归模型，在 R 中以新的数据帧作为输出？

Question

我正在尝试运行多元线性回归模型，数据框中的列都在不同时间开始。

df = structure(list(Date_Time_GMT_3 = 
                      structure(c(1622552400, 1622553300,1622554200, 1622555100, 1622556000, 1622556900), 
                                class = c("POSIXct","POSIXt"), 
                                tzone = "EST"),
                    X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753), 
                    X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_), 
                    X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846), 
                    X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_), 
                    X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404), 
                    X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_), 
                    X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567), 
                    X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_), 
                    X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821), 
                    X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235), 
                    X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_), 
                    X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846), 
                    X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)), 
               row.names = c(NA, 6L), class = "data.frame")

我正在尝试对列名称中不带 stationary 的所有列执行线性回归模型到列名称中带 stationary 的列（即 x 轴上的 stationary ).但是，正如您在示例数据框中看到的那样，所有列的开始和结束时间都不同，因此当 stationary 和非固定列具有值时，我还需要线性回归模型仅运行同时

总的来说，当运行针对每个 stationary 记录器时，我希望代码的输出为每个“非固定”记录器提供所有值。例如，table 如下所示...

X20819830_R1AR_U_Stationary
Logger_ID        Reg_equation R_Squared Estimate_Std. Std_Error  Pr_t..
  <chr>            <int>               <int>     <int>        <int>     <int>   
1 X20676887_F8AR_U NA                  NA        NA            NA         NA      
2 X20819831_X1AR_U NA                  NA        NA            NA         NA      
3 X20822214_R2AR_U NA                  NA        NA            NA         NA  
1 X20676887_R7AR_U NA                  NA        NA            NA         NA      
2 X20819831_R5AR_U NA                  NA        NA            NA         NA      
    

X20822215_R3AR_U_Stationary
Logger_ID        Reg_equation R_Squared Estimate_Std. Std_Error  Pr_t..
  <chr>            <int>               <int>     <int>        <int>     <int>   
1 X20676887_F8AR_U NA                  NA        NA            NA         NA      
2 X20819831_X1AR_U NA                  NA        NA            NA         NA      
3 X20822214_R2AR_U NA                  NA        NA            NA         NA  
1 X20676887_R7AR_U NA                  NA        NA            NA         NA      
2 X20819831_R5AR_U NA                  NA        NA            NA         NA

如果我只有 1 个 stationary 列，并且如果所有列在整个时间内都有值……我有代码可以做到这一点……看起来像这样

library(tidyverse)
library(broom)
df1 %>% 
  pivot_longer(
    cols = starts_with("X")
  ) %>% 
  mutate(name = factor(name)) %>% 
  group_by(name) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(X20819830_R3AR_U_Stationary ~ value, data = df) %>% 
      glance() %>% 
      # tidy() %>%  
      add_column(name = unique(df$name), .before=1)
  })

但如您所见，它只考虑了 1 stationary 列，当我运行它与我提供的示例数据框时，我得到了这个错误

Error in eval(predvars, data, env) : 
object 'X20819830_R3AR_U_Stationary' not found

有什么想法吗？

Answer 1

既然您已经清楚想要实现的目标，事实证明通过应用 pivot_longer 两次非常容易，一次用于固定式，然后一次用于 non-stationary 记录器。

旁白：请看解释结果时如何修正多重假设检验。这是很多回归。

library("broom")
library("tidyverse")

extract_statistics <- function(fit) {
  bind_cols(
    # Extract statistics about model coefficients
    tidy(fit) %>% filter(term != "(Intercept)"),
    # Extract statistics about model fit
    glance(fit) %>% select(matches("r.squared"))
  )
}

results <-
  as_tibble(df) %>%
  pivot_longer(
    ends_with("STATIONARY"),
    names_to = "response",
    values_to = "y"
  ) %>%
  # Every column other than the time index & the newly minted response & y columns
  # corresponds to a non-stationary logger.
  pivot_longer(
    -c(Date_Time_GMT_3, response, y),
    names_to = "predictor",
    values_to = "x"
  ) %>%
  # It's not strictly necessary; `lm` drops data points with missing values
  drop_na() %>%
  group_by(
    response, predictor
  ) %>%
  group_modify(
    # ~ tidy(lm(y ~ x, data = .))
    # ~ glance(lm(y ~ x, data = .))
    ~ extract_statistics(lm(y ~ x, data = .))
  ) %>%
  ungroup()
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
results
#> # A tibble: 28 × 9
#>    response       predictor term  estimate std.error statistic p.value r.squared
#>    <chr>          <chr>     <chr>    <dbl>     <dbl>     <dbl>   <dbl>     <dbl>
#>  1 X20819742_R1A… X2081772… x       0.0893   0.0362      2.47   0.245     0.859 
#>  2 X20819742_R1A… X2081772… x       0.599    0.677       0.885  0.539     0.439 
#>  3 X20819742_R1A… X2081974… x      -0.0932   0.00776   -12.0    0.0528    0.993 
#>  4 X20819742_R1A… X2081974… x      -0.0117   0.0761     -0.154  0.903     0.0231
#>  5 X20819830_R1A… X2081772… x       0.712    0.804       0.886  0.539     0.440 
#>  6 X20819830_R1A… X2081772… x     -10.3      0.857     -12.1    0.0527    0.993 
#>  7 X20819830_R1A… X2081974… x      -1.49     0.222      -6.70   0.0944    0.978 
#>  8 X20822215_R3A… X2081772… x       0.0154   0.130       0.118  0.925     0.0138
#>  9 X20822215_R3A… X2081772… x       1.20     0.271       4.43   0.141     0.951 
#> 10 X20822215_R3A… X2081974… x      -0.0703   0.106      -0.663  0.627     0.305 
#> # … with 18 more rows, and 1 more variable: adj.r.squared <dbl>

让我们看看导致完美匹配警告的记录器对：

results %>%
  slice_min(
    std.error,
    n = 2
  )
#> # A tibble: 2 × 9
#>   response       predictor term  estimate std.error statistic  p.value r.squared
#>   <chr>          <chr>     <chr>    <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1 X20822215_R3A… X2081772… x            1  1.77e-16   5.65e15 1.13e-16         1
#> 2 X20874235_R4A… X2081772… x            1  1.77e-16   5.65e15 1.13e-16         1
#> # … with 1 more variable: adj.r.squared <dbl>

是的，非常合适。

df %>%
  select(X20822215_R3AR_U_Stationary, X20817727_R5AR_U)
#>   X20822215_R3AR_U_Stationary X20817727_R5AR_U
#> 1                          NA               NA
#> 2                          NA               NA
#> 3                          NA               NA
#> 4                      13.942           13.942
#> 5                      13.942           13.942
#> 6                      13.846           13.846

df %>%
  select(X20874235_R4AR_S_Stationary, X20817727_R7AR)
#>   X20874235_R4AR_S_Stationary X20817727_R7AR
#> 1                      14.230         14.230
#> 2                      14.230         14.230
#> 3                      14.134         14.134
#> 4                          NA             NA
#> 5                          NA             NA
#> 6                          NA             NA

^{由 reprex package (v2.0.1)}

创建于 2022-03-21

我提前为代码重复道歉。这是对 OP 进一步问题的回应。

这里是如何切换 response 和 predictor 的角色。

df <- structure(list(
  Date_Time_GMT_3 =
    structure(c(1622552400, 1622553300, 1622554200, 1622555100, 1622556000, 1622556900),
      class = c("POSIXct", "POSIXt"),
      tzone = "EST"
    ),
  X20819830_R1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 16.808, 16.713, 17.753),
  X20819742_R1AR_S_Stationary = c(16.903, 16.828, 16.808, NA_real_, NA_real_, NA_real_),
  X20822215_R3AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
  X20822215_R3AR_S_Stationary = c(13.942, 13.972, 13.842, NA_real_, NA_real_, NA_real_),
  X20874235_R4AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 14.134, 14.534, 14.404),
  X20874235_R4AR_S_Stationary = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_),
  X20874311_F1AR_U_Stationary = c(NA_real_, NA_real_, NA_real_, 15.187, 15.327, 15.567),
  X20874311_F1AR_S_Stationary = c(15.282, 15.387, 15.587, NA_real_, NA_real_, NA_real_),
  X20817727_F8AR_U = c(15.421, 14.441, 14.631, 14.781, 15.521, 15.821),
  X20819742_X1AR_U = c(14.996, 15.996, 14.776, 14.920, 14.870, 14.235),
  X20819742_R2AR_U = c(14.781, 15.521, 15.821, NA_real_, NA_real_, NA_real_),
  X20817727_R5AR_U = c(NA_real_, NA_real_, NA_real_, 13.942, 13.942, 13.846),
  X20817727_R7AR = c(14.23, 14.23, 14.134, NA_real_, NA_real_, NA_real_)
),
row.names = c(NA, 6L), class = "data.frame"
)

library("broom")
library("tidyverse")

extract_statistics <- function(fit) {
  bind_cols(
    # Extract statistics about model coefficients
    tidy(fit) %>% filter(term != "(Intercept)"),
    # Extract statistics about model fit
    glance(fit) %>% select(matches("r.squared"))
  )
}

results <-
  as_tibble(df) %>%
  # Pivot non-stationary loggers
  pivot_longer(
    -c(Date_Time_GMT_3, ends_with("STATIONARY")),
    names_to = "response",
    values_to = "y"
  ) %>%
  # Every column other than the time index & the newly minted response & y columns
  # corresponds to a non-stationary logger.
  pivot_longer(
    -c(Date_Time_GMT_3, response, y),
    names_to = "predictor",
    values_to = "x"
  ) %>%
  # It's not strictly necessary; `lm` drops data points with missing values
  drop_na() %>%
  group_by(
    response, predictor
  ) %>%
  group_modify(
    # ~ tidy(lm(y ~ x, data = .))
    # ~ glance(lm(y ~ x, data = .))
    ~ extract_statistics(lm(y ~ x, data = .))
  ) %>%
  ungroup()
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
results
#> # A tibble: 28 × 9
#>    response      predictor term  estimate std.error statistic  p.value r.squared
#>    <chr>         <chr>     <chr>    <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#>  1 X20817727_F8… X2081974… x       9.62    3.90e+ 0  2.47e+ 0 2.45e- 1    0.859 
#>  2 X20817727_F8… X2081983… x       0.617   6.97e- 1  8.86e- 1 5.39e- 1    0.440 
#>  3 X20817727_F8… X2082221… x       0.896   7.58e+ 0  1.18e- 1 9.25e- 1    0.0138
#>  4 X20817727_F8… X2082221… x      -6.98    6.68e+ 0 -1.05e+ 0 4.86e- 1    0.522 
#>  5 X20817727_F8… X2087423… x       3.13    8.84e+ 0  3.53e- 1 7.84e- 1    0.111 
#>  6 X20817727_F8… X2087423… x       2.15    1.50e+ 0  1.44e+ 0 3.87e- 1    0.673 
#>  7 X20817727_F8… X2087431… x      -2.12    2.60e+ 0 -8.18e- 1 5.64e- 1    0.401 
#>  8 X20817727_F8… X2087431… x       2.58    1.06e+ 0  2.43e+ 0 2.49e- 1    0.855 
#>  9 X20817727_R5… X2081983… x      -0.0961  7.96e- 3 -1.21e+ 1 5.27e- 2    0.993 
#> 10 X20817727_R5… X2082221… x       1       1.77e-16  5.65e+15 1.13e-16    1     
#> # … with 18 more rows, and 1 more variable: adj.r.squared <dbl>

^{由 reprex package (v2.0.1)}

创建于 2022-03-31

如何运行在不同时间开始的列的多元线性回归模型，在 R 中以新的数据帧作为输出？

How to run multiple linear regression models for columns that start at different times, with a new dataframe as output, in R?

r

linear-regression

dataframe

如何 运行 在不同时间开始的列的多元线性回归模型，在 R 中以新的数据帧作为输出？

How to run multiple linear regression models for columns that start at different times, with a new dataframe as output, in R?

r

linear-regression

dataframe

如何运行在不同时间开始的列的多元线性回归模型，在 R 中以新的数据帧作为输出？