逻辑回归:'odds ratio' 本质上只是比率——有什么意义?

Logistic regression: 'odds ratio' is essentially just the ratio - what's the point?

试图了解逻辑回归的使用。我有以下数据:

Gender  Age No.transcation  Transaction
female  18-24   138485  4047
male    18-24   144301  3766
female  25-34   248362  7559
male    25-34   295800  8126
female  35-44   265514  7171
male    35-44   379872  9047
female  45-54   295002  7072
male    45-54   421432  9648
female  55-64   382198  7529
male    55-64   456308  9016
female  65+ 352501  4856
male    65+ 465253  6889

运行 R 中的逻辑回归我得到以下摘要输出

    > mod2 <- glm(cbind(Transaction, No.transcation) ~ Gender + Age, data = csvd, 
family = binomial())
    > summary(mod2)

    Call:
    glm(formula = cbind(Transaction, No.transcation) ~ Gender + Age, 
        family = binomial(), data = csvd)

    Deviance Residuals: 
          1        2        3        4        5        6  
     1.8732  -1.9018   2.2654  -2.1473   3.4810  -3.0228  
             7        8        9       10       11    12  
     -0.2772   0.2377  -2.5500   2.3717  -4.9638   4.3408  

    Coefficients:
                 Estimate Std. Error  z value Pr(>|z|)    
    (Intercept) -3.562800   0.011984 -297.290  < 2e-16 ***
    Gendermale  -0.051852   0.006993   -7.415 1.22e-13 ***
    Age25-34     0.044091   0.014042    3.140  0.00169 ** 
    Age35-44    -0.090757   0.013966   -6.499 8.11e-11 ***
    Age45-54    -0.164705   0.013894  -11.855  < 2e-16 ***
    Age55-64    -0.334841   0.013900  -24.088  < 2e-16 ***
    Age65+      -0.651142   0.014767  -44.094  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    (Dispersion parameter for binomial family taken to be 1)

        Null deviance: 4490.792  on 11  degrees of freedom
    Residual deviance:   93.866  on  5  degrees of freedom
    AIC: 235.5

    Number of Fisher Scoring iterations: 3

对系数取幂得到优势比,我发现它们几乎与交易用户的比率相同:

> exp(summary(mod2)$coefficients)
              Estimate Std. Error       z value Pr(>|z|)
(Intercept) 0.02835931   1.012056 7.735499e-130 1.000000
Gendermale  0.94946976   1.007018  6.022806e-04 1.000000
Age25-34    1.04507762   1.014141  2.310243e+01 1.001691
Age35-44    0.91323954   1.014064  1.505641e-03 1.000000
Age45-54    0.84814413   1.013991  7.106341e-06 1.000000
Age55-64    0.71545181   1.013998  3.455562e-11 1.000000
Age65+      0.52145005   1.014877  7.084264e-20 1.000000

将比值比与交易用户的相对比率除以每组用户总数(并将其与男性和 18-24 岁基本组进行比较)进行比较,我得到几乎相同的数字:

female  
male    94.68%


18-24   
25-34   104.21%
35-44   91.17%
45-54   84.82%
55-64   71.97%
65+ 52.66%

那么 运行 逻辑回归的意义何在?这个数据集只有 2 个特征,但也可以扩展到 50 个特征。在这种情况下,LR 与仅查看每个组的比率有什么用?是不是因为所有的变量都是名义上的所以加的不多?

你会希望估计的优势比接近这样的实际比例。您正在估计概率 pr(Y=1|X=x) ;给定年龄和性别的交易概率。有了像这样的分类预测器,一个直观的估计就是数据中结果的比例。当预测变量是连续变量时,逻辑回归会变得更有趣,并且您希望针对您未观察到的预测变量的某些值预测结果的概率。在这些情况下,LR 允许您将预测变量的无界线性函数映射到根据定义必须在 0 和 1 之间有界的概率。