R 与 SPSS 中虚拟变量的标准化回归系数

Standardized regression coefficients with dummy variables in R vs. SPSS

我发现标准化 (beta) 系数与使用虚拟编码变量的 R 和 SPSS 计算的线性回归模型存在令人费解的差异。我已经使用了hsb2数据集并创建了一个对比(虚拟编码),所以第三类是参考。这是 R 代码:

# Read the data
hsb2 <- read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv', header = TRUE, sep = ",")

# Create a factor variable with respondents' race
hsb2$race.f <- factor(hsb2$race, labels = c("Hispanic", "Asian", "African-Am", "Caucasian"))

# Add a contrast (dummy coding) to the new race variable, so that the third category is the reference.
contrasts(hsb2$race.f) <- contr.treatment(n = 4, base = 3)

# Scale the writing achievement score (mean of 0 and SD of 1), it will be the dependent variable
hsb2$write <- scale(hsb2$write)

# Fit the model and print the summary
summary(lm(write ~ race.f, hsb2))

我得到的输出:

Call:
lm(formula = write ~ race.f, data = hsb2)

Residuals:
                 Min                   1Q               Median                   3Q                  Max 
-2.43234300577889240 -0.57585945002954031  0.10259059641484436  0.73850677561040290  1.98341819735365221 

Coefficients:
                        Estimate           Std. Error              t value  Pr(>|t|)   
(Intercept) -0.48266692834536767  0.21290900103341129 -2.26700999999999997 0.0244812 * 
race.f1     -0.18374751916973245  0.28828015018135283 -0.63739000000000001 0.5246133   
race.f2      1.03390948585456388  0.35741973343705952  2.89270000000000005 0.0042513 **
race.f4      0.61772635713618673  0.22711822910747051  2.71984000000000004 0.0071181 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.050000000000000003 ‘.’ 0.10000000000000001 ‘ ’ 1

Residual standard error: 0.95215799866456285 on 196 degrees of freedom
Multiple R-squared:  0.1070625554447362515, Adjusted R-squared:  0.09339514557909434078 
F-statistic: 7.833419535758452845 on 3 and 196 DF,  p-value: 0.000057845156841983661

然而,当我 运行 使用 SPSS 进行相同的分析时,我得到了完全不同的 beta 回归系数,这里是代码:

* Create the dummy variables.
RECODE race (1 = 1) (ELSE = 0) INTO race.f1.
RECODE race (2 = 1) (ELSE = 0) INTO race.f2.
RECODE race (3 = 1) (ELSE = 0) INTO race.f3.
RECODE race (4 = 1) (ELSE = 0) INTO race.f4.

EXECUTE.

* Execute the analysis, so that the third category is the reference.
REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT write
  /METHOD=ENTER race.f1 race.f2 race.f4.

这是我得到的 SPSS 输出:

真正让我感到困惑的是,其他一切都是一样的(模型统计 - R2、调整后的 R2、自由度、F 统计;以及 beta 回归系数的 t 值和 p 值),但标准化的 beta 回归系数甚至都不接近。如果我 运行 没有标准化,则非标准化回归系数和所有其他统计数据在 R 和 SPSS 之间匹配。

有人可以帮忙吗?我错过了什么吗?

编辑 按照aosmith提供的源码(再次感谢),我手工做了虚拟编码,缩放了单独的虚拟机:

hsb2 <- read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv', header = TRUE, sep = ",")

hsb2$write <- scale(hsb2$write)

hsb2$race.f1 <- scale(hsb2$race == 1)
hsb2$race.f2 <- scale(hsb2$race == 2)
hsb2$race.f3 <- scale(hsb2$race == 3)
hsb2$race.f4 <- scale(hsb2$race == 4)

summary(lm(write ~ race.f1 + race.f2 + race.f4, hsb2))

我得到了与 SPSS 中完全相同的结果:

Call:
lm(formula = write ~ race.f1 + race.f2 + race.f4, data = hsb2)

Residuals:
                Min                  1Q              Median                  3Q                 Max 
-2.4323430057788924 -0.5758594500295402  0.1025905964148444  0.7385067756104029  1.9834181973536520 

Coefficients:
                                        Estimate                           Std. Error              t value  Pr(>|t|)   
(Intercept)  0.000000000000000030665367318040625  0.067327737761672404315227424831392  0.00000000000000000 1.0000000   
race.f1     -0.059860715422078700220787084163021  0.093915042280922900186368451613816 -0.63739000000000001 0.5246133   
race.f2      0.236302452210854940783946176452446  0.081689123308428354675037041943142  2.89270000000000005 0.0042513 **
race.f4      0.276515793804944842726456499804044  0.101666015515960786452787090183847  2.71984000000000004 0.0071181 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.050000000000000003 ‘.’ 0.10000000000000001 ‘ ’ 1

Residual standard error: 0.95215799866456285 on 196 degrees of freedom
Multiple R-squared:  0.1070625554447362238, Adjusted R-squared:  0.09339514557909434078 
F-statistic: 7.833419535758451957 on 3 and 196 DF,  p-value: 0.000057845156841983668

但是,在自定义函数中使用这种方法并不是很方便。我想知道是否有办法仍然使用 contrasts 函数来分配假人。

正如@aosmith 指出的那样,SPSS 默认值是 "interesting"。但是,如果我们有一个 contr.SAS,我们就可以有一个 contr.spss,这似乎很公平。因此,在其他人的及时帮助下,这里是您的一个选择。

我在下面放了一个可复制的 hsb2 片段。您的原始设置和@aosmith 的见解。

# hsb2 <- read.table("hsb2.csv", header = TRUE, sep = ",")
hsb2$write <- scale(hsb2$write)
hsb2$race.f <- factor(hsb2$race, labels = c("Hispanic",
                                            "Asian",
                                            "African-Am",
                                            "Caucasian"))
# Courtesy @aosmith
hsb2$race.f1 <- scale(hsb2$race == 1)
hsb2$race.f2 <- scale(hsb2$race == 2)
hsb2$race.f3 <- scale(hsb2$race == 3)
hsb2$race.f4 <- scale(hsb2$race == 4)

由于我添加了一些错误检查,该函数比严格需要的要长。它只接受 factors 并且你给它因子名称和 base 是什么。

# Many thanks to @akrun
contr.spss <- function (variable, base = 1)
{
   if (is.factor(variable)) {
      column_names <- as.character(sort(unique(as.integer(variable))))
   } else {
      stop("the variable must be a factor to define contrasts")
   }
   if (nlevels(variable) > 2L) {
      n <- nlevels(variable)
      lvls <- levels(variable)
   } else {
      stop("not enough factor levels to define contrasts")
   }
   if (base < 1L | base > n) {
      stop("baseline group number out of range")
   }

   m1 <- matrix(ncol = n, nrow = n, dimnames = list(lvls, column_names))
   for(i in seq_along(lvls)) {
      which_lvl <- unique(variable == lvls[i])
      tmp <- unique(scale(variable == lvls[i]))[,1]
      m1[i,i] <- ifelse(isTRUE(which_lvl[[1]]), tmp[1], tmp[2])
      m1[-i,i] <- ifelse(isFALSE(which_lvl[[1]]), tmp[1], tmp[2])
   }

   m1 <-m1[, -base]
   return(m1)
}

默认r对比

contrasts(hsb2$race.f) # default
#>            Asian African-Am Caucasian
#> Hispanic       0          0         0
#> Asian          1          0         0
#> African-Am     0          1         0
#> Caucasian      0          0         1

使用函数并应用新的对比。

spss.contrasts <- contr.spss(hsb2$race.f, base = 3)
spss.contrasts

# Next two are equivalent
contrasts(hsb2$race.f) <- spss.contrasts
contrasts(hsb2$race.f) <- contr.spss(hsb2$race.f, base = 3)
# All set
contrasts(hsb2$race.f)
#>                     1          2          4
#> Hispanic    2.7012343 -0.2406451 -1.6196240
#> Asian      -0.3683501  4.1347200 -1.6196240
#> African-Am -0.3683501 -0.2406451 -1.6196240
#> Caucasian  -0.3683501 -0.2406451  0.6143401

瞧,同样的结果

summary(lm(write ~ race.f1 + race.f2 + race.f4, hsb2))
#> 
#> Call:
#> lm(formula = write ~ race.f1 + race.f2 + race.f4, data = hsb2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.4323 -0.5759  0.1026  0.7385  1.9834 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  3.067e-17  6.733e-02   0.000  1.00000   
#> race.f1     -5.986e-02  9.392e-02  -0.637  0.52461   
#> race.f2      2.363e-01  8.169e-02   2.893  0.00425 **
#> race.f4      2.765e-01  1.017e-01   2.720  0.00712 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9522 on 196 degrees of freedom
#> Multiple R-squared:  0.1071, Adjusted R-squared:  0.0934 
#> F-statistic: 7.833 on 3 and 196 DF,  p-value: 5.785e-05
summary(lm(write ~ race.f, hsb2))
#> 
#> Call:
#> lm(formula = write ~ race.f, data = hsb2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.4323 -0.5759  0.1026  0.7385  1.9834 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  3.067e-17  6.733e-02   0.000  1.00000   
#> race.f1     -5.986e-02  9.392e-02  -0.637  0.52461   
#> race.f2      2.363e-01  8.169e-02   2.893  0.00425 **
#> race.f4      2.765e-01  1.017e-01   2.720  0.00712 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9522 on 196 degrees of freedom
#> Multiple R-squared:  0.1071, Adjusted R-squared:  0.0934 
#> F-statistic: 7.833 on 3 and 196 DF,  p-value: 5.785e-05

您的数据被转载...


hsb2 <- structure(list(id = c(70L, 121L, 86L, 141L, 172L, 113L, 50L,
                              11L, 84L, 48L, 75L, 60L, 95L, 104L, 38L, 115L, 76L, 195L, 114L,
                              85L, 167L, 143L, 41L, 20L, 12L, 53L, 154L, 178L, 196L, 29L, 126L,
                              103L, 192L, 150L, 199L, 144L, 200L, 80L, 16L, 153L, 176L, 177L,
                              168L, 40L, 62L, 169L, 49L, 136L, 189L, 7L, 27L, 128L, 21L, 183L,
                              132L, 15L, 67L, 22L, 185L, 9L, 181L, 170L, 134L, 108L, 197L,
                              140L, 171L, 107L, 81L, 18L, 155L, 97L, 68L, 157L, 56L, 5L, 159L,
                              123L, 164L, 14L, 127L, 165L, 174L, 3L, 58L, 146L, 102L, 117L,
                              133L, 94L, 24L, 149L, 82L, 8L, 129L, 173L, 57L, 100L, 1L, 194L,
                              88L, 99L, 47L, 120L, 166L, 65L, 101L, 89L, 54L, 180L, 162L, 4L,
                              131L, 125L, 34L, 106L, 130L, 93L, 163L, 37L, 35L, 87L, 73L, 151L,
                              44L, 152L, 105L, 28L, 91L, 45L, 116L, 33L, 66L, 72L, 77L, 61L,
                              190L, 42L, 2L, 55L, 19L, 90L, 142L, 17L, 122L, 191L, 83L, 182L,
                              6L, 46L, 43L, 96L, 138L, 10L, 71L, 139L, 110L, 148L, 109L, 39L,
                              147L, 74L, 198L, 161L, 112L, 69L, 156L, 111L, 186L, 98L, 119L,
                              13L, 51L, 26L, 36L, 135L, 59L, 78L, 64L, 63L, 79L, 193L, 92L,
                              160L, 32L, 23L, 158L, 25L, 188L, 52L, 124L, 175L, 184L, 30L,
                              179L, 31L, 145L, 187L, 118L, 137L), female = c(0L, 1L, 0L, 0L,
                                                                             0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                                                                             0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                                                                             0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                                                                             0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                                                                             0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                                                                             0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                                                             1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                                                             1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                                                             1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                                                             1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                                                             1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                                                             1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                                                             1L, 1L, 1L, 1L), race = c(4L, 4L, 4L, 4L, 4L, 4L, 3L, 1L, 4L,
                                                                                                       3L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 1L, 1L,
                                                                                                       3L, 4L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L,
                                                                                                       4L, 4L, 3L, 4L, 4L, 3L, 4L, 4L, 1L, 2L, 4L, 1L, 4L, 4L, 1L, 4L,
                                                                                                       1L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 4L,
                                                                                                       4L, 4L, 1L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 4L,
                                                                                                       4L, 2L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 3L, 4L, 4L,
                                                                                                       4L, 4L, 4L, 3L, 4L, 4L, 1L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 3L, 1L,
                                                                                                       4L, 4L, 4L, 3L, 4L, 4L, 2L, 4L, 3L, 4L, 2L, 4L, 4L, 4L, 4L, 4L,
                                                                                                       3L, 1L, 3L, 1L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 1L, 3L, 3L, 4L, 4L,
                                                                                                       1L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
                                                                                                       4L, 4L, 1L, 3L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L,
                                                                                                       2L, 4L, 2L, 4L, 3L, 4L, 4L, 4L, 2L, 4L, 2L, 4L, 4L, 4L, 4L),
                       write = c(52L, 59L, 33L, 44L, 52L, 52L, 59L, 46L, 57L, 55L,
                                 46L, 65L, 60L, 63L, 57L, 49L, 52L, 57L, 65L, 39L, 49L, 63L,
                                 40L, 52L, 44L, 37L, 65L, 57L, 38L, 44L, 31L, 52L, 67L, 41L,
                                 59L, 65L, 54L, 62L, 31L, 31L, 47L, 59L, 54L, 41L, 65L, 59L,
                                 40L, 59L, 59L, 54L, 61L, 33L, 44L, 59L, 62L, 39L, 37L, 39L,
                                 57L, 49L, 46L, 62L, 44L, 33L, 42L, 41L, 54L, 39L, 43L, 33L,
                                 44L, 54L, 67L, 59L, 45L, 40L, 61L, 59L, 36L, 41L, 59L, 49L,
                                 59L, 65L, 41L, 62L, 41L, 49L, 31L, 49L, 62L, 49L, 62L, 44L,
                                 44L, 62L, 65L, 65L, 44L, 63L, 60L, 59L, 46L, 52L, 59L, 54L,
                                 62L, 35L, 54L, 65L, 52L, 50L, 59L, 65L, 61L, 44L, 54L, 67L,
                                 57L, 47L, 54L, 52L, 52L, 46L, 62L, 57L, 41L, 53L, 49L, 35L,
                                 59L, 65L, 62L, 54L, 59L, 63L, 59L, 52L, 41L, 49L, 46L, 54L,
                                 42L, 57L, 59L, 52L, 62L, 52L, 41L, 55L, 37L, 54L, 57L, 54L,
                                 62L, 59L, 55L, 57L, 39L, 67L, 62L, 50L, 61L, 62L, 59L, 44L,
                                 59L, 54L, 62L, 60L, 57L, 46L, 36L, 59L, 49L, 60L, 67L, 54L,
                                 52L, 65L, 62L, 49L, 67L, 65L, 67L, 65L, 54L, 44L, 62L, 46L,
                                 54L, 57L, 52L, 59L, 65L, 59L, 46L, 41L, 62L, 65L)), class = "data.frame", row.names = c(NA,
                                                                                                                         -200L))