Rglm.fit不return概率?
R glm.fit does not return probability?
首先 post 在这里,R 的新手。所以如果我没有得到这个 post 正确的:)。
我正在尝试使用 glm() 来拟合模型,然后在模型上使用预测。
fit_GLM <- glm(y ~., data = traintemp, family = "binomial")
pred_GLM <- predict(fit_GLM, newdata = testtemp)
我的训练数据包含大约 430000 个观察值,有 6 个预测变量和一个二元结果。我尝试用 0-1 或 False-True 改变结果。
我的测试数据包含大约 215000 个观察值。
我可以成功运行模型,但是predict函数返回的数据有点奇怪。 (对我来说)我期待一个概率,但是函数 returns:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0433000 -0.0006504 0.0004760 0.0103800 0.0024810 1.0020000
我是不是遗漏了什么明显的东西?
此外,如果我改为 运行 lm(),结果非常相似,但是 运行 速度太快了,这是怎么回事?
编辑:我的数据示例:
TripType VisitNumber Weekday Upc ScanCount DepartmentDescription FinelineNumber
1 0 7 Friday 60538815980 1 SHOES 8931
2 0 7 Friday 7410811099 1 PERSONAL CARE 4504
3 0 8 Friday 2006613744 2 PAINT AND ACCESSORIES 1017
4 0 8 Friday 2006618783 2 PAINT AND ACCESSORIES 1017
5 0 8 Friday 7004802737 1 PAINT AND ACCESSORIES 2802
6 0 8 Friday 2238495318 1 PAINT AND ACCESSORIES 4501
谢谢你,感恩节快乐!
编辑 23 列火车:
TripType Weekday Upc ScanCount DepartmentDescription FinelineNumber
1 0 Friday 60538815980 1 SHOES 8931
2 0 Friday 7410811099 1 PERSONAL CARE 4504
3 0 Friday 2006613744 2 PAINT AND ACCESSORIES 1017
4 0 Friday 2006618783 2 PAINT AND ACCESSORIES 1017
5 0 Friday 7004802737 1 PAINT AND ACCESSORIES 2802
6 0 Friday 2238495318 1 PAINT AND ACCESSORIES 4501
7 0 Friday 5200010239 1 DSD GROCERY 4606
8 0 Friday 88679300501 2 PAINT AND ACCESSORIES 3504
9 0 Friday 2238400200 2 PAINT AND ACCESSORIES 3565
10 0 Friday 72450408840 1 PAINT AND ACCESSORIES 1028
11 0 Friday 25541500000 2 DAIRY 1305
12 0 Friday 72450403700 2 PAINT AND ACCESSORIES 1018
13 0 Friday 7874204967 1 HOUSEHOLD CHEMICALS/SUPP 707
14 0 Friday 3270011053 3 PETS AND SUPPLIES 1001
15 0 Friday 1070080727 1 IMPULSE MERCHANDISE 115
16 0 Friday 3107 1 PRODUCE 103
17 0 Friday 4011 1 PRODUCE 5501
18 0 Friday 6414410235 1 DSD GROCERY 2008
19 0 Friday 4178900743 1 GROCERY DRY GOODS 3114
20 0 Friday 7800002374 1 DSD GROCERY 3467
测试:
TripType Weekday Upc ScanCount DepartmentDescription FinelineNumber
1 0 Friday 68113152929 -1 FINANCIAL SERVICES 1000
2 0 Friday 2238403510 2 PAINT AND ACCESSORIES 3565
3 0 Friday 2006613743 1 PAINT AND ACCESSORIES 1017
4 0 Friday 2238400200 -1 PAINT AND ACCESSORIES 3565
5 0 Friday 22006000000 1 MEAT - FRESH & FROZEN 6009
6 0 Friday 2236760452 1 PAINT AND ACCESSORIES 7
7 0 Friday 88679300501 -1 PAINT AND ACCESSORIES 3504
8 0 Friday 3019294203 1 PAINT AND ACCESSORIES 2801
9 0 Friday 2310010776 1 PETS AND SUPPLIES 3300
10 0 Friday 5114139038 1 PAINT AND ACCESSORIES 4415
11 0 Friday 5114197561 1 PAINT AND ACCESSORIES 4415
12 0 Friday 2800053970 1 CANDY, TOBACCO, COOKIES 115
13 0 Friday 7794800902 1 DSD GROCERY 7950
14 0 Friday 7920018317 1 IMPULSE MERCHANDISE 110
15 0 Friday 3500076633 1 PERSONAL CARE 203
16 0 Friday 5460010568 1 HOUSEHOLD CHEMICALS/SUPP 52
17 0 Friday 2899521479 1 FABRICS AND CRAFTS 1059
18 0 Friday 2899521979 1 FABRICS AND CRAFTS 1062
19 0 Friday 1200004300 1 DSD GROCERY 9501
20 0 Friday 88743955560 1 MENS WEAR 144
来自?predict.glm
:
所需的预测类型。默认值在线性预测变量的范围内;备选方案 "response" 在响应变量的范围内。因此,对于默认的二项式模型,默认预测是对数赔率(logit 尺度上的概率)并且 type = "response" 给出预测概率 。 "terms" 选项 returns 一个矩阵,给出模型公式中每一项在线性预测尺度上的拟合值。
所以在你的情况下:
pred_GLM <- predict(fit_GLM, newdata = testtemp, type = "response")
首先 post 在这里,R 的新手。所以如果我没有得到这个 post 正确的:)。
我正在尝试使用 glm() 来拟合模型,然后在模型上使用预测。
fit_GLM <- glm(y ~., data = traintemp, family = "binomial")
pred_GLM <- predict(fit_GLM, newdata = testtemp)
我的训练数据包含大约 430000 个观察值,有 6 个预测变量和一个二元结果。我尝试用 0-1 或 False-True 改变结果。
我的测试数据包含大约 215000 个观察值。
我可以成功运行模型,但是predict函数返回的数据有点奇怪。 (对我来说)我期待一个概率,但是函数 returns:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0433000 -0.0006504 0.0004760 0.0103800 0.0024810 1.0020000
我是不是遗漏了什么明显的东西?
此外,如果我改为 运行 lm(),结果非常相似,但是 运行 速度太快了,这是怎么回事?
编辑:我的数据示例:
TripType VisitNumber Weekday Upc ScanCount DepartmentDescription FinelineNumber
1 0 7 Friday 60538815980 1 SHOES 8931
2 0 7 Friday 7410811099 1 PERSONAL CARE 4504
3 0 8 Friday 2006613744 2 PAINT AND ACCESSORIES 1017
4 0 8 Friday 2006618783 2 PAINT AND ACCESSORIES 1017
5 0 8 Friday 7004802737 1 PAINT AND ACCESSORIES 2802
6 0 8 Friday 2238495318 1 PAINT AND ACCESSORIES 4501
谢谢你,感恩节快乐!
编辑 23 列火车:
TripType Weekday Upc ScanCount DepartmentDescription FinelineNumber
1 0 Friday 60538815980 1 SHOES 8931
2 0 Friday 7410811099 1 PERSONAL CARE 4504
3 0 Friday 2006613744 2 PAINT AND ACCESSORIES 1017
4 0 Friday 2006618783 2 PAINT AND ACCESSORIES 1017
5 0 Friday 7004802737 1 PAINT AND ACCESSORIES 2802
6 0 Friday 2238495318 1 PAINT AND ACCESSORIES 4501
7 0 Friday 5200010239 1 DSD GROCERY 4606
8 0 Friday 88679300501 2 PAINT AND ACCESSORIES 3504
9 0 Friday 2238400200 2 PAINT AND ACCESSORIES 3565
10 0 Friday 72450408840 1 PAINT AND ACCESSORIES 1028
11 0 Friday 25541500000 2 DAIRY 1305
12 0 Friday 72450403700 2 PAINT AND ACCESSORIES 1018
13 0 Friday 7874204967 1 HOUSEHOLD CHEMICALS/SUPP 707
14 0 Friday 3270011053 3 PETS AND SUPPLIES 1001
15 0 Friday 1070080727 1 IMPULSE MERCHANDISE 115
16 0 Friday 3107 1 PRODUCE 103
17 0 Friday 4011 1 PRODUCE 5501
18 0 Friday 6414410235 1 DSD GROCERY 2008
19 0 Friday 4178900743 1 GROCERY DRY GOODS 3114
20 0 Friday 7800002374 1 DSD GROCERY 3467
测试:
TripType Weekday Upc ScanCount DepartmentDescription FinelineNumber
1 0 Friday 68113152929 -1 FINANCIAL SERVICES 1000
2 0 Friday 2238403510 2 PAINT AND ACCESSORIES 3565
3 0 Friday 2006613743 1 PAINT AND ACCESSORIES 1017
4 0 Friday 2238400200 -1 PAINT AND ACCESSORIES 3565
5 0 Friday 22006000000 1 MEAT - FRESH & FROZEN 6009
6 0 Friday 2236760452 1 PAINT AND ACCESSORIES 7
7 0 Friday 88679300501 -1 PAINT AND ACCESSORIES 3504
8 0 Friday 3019294203 1 PAINT AND ACCESSORIES 2801
9 0 Friday 2310010776 1 PETS AND SUPPLIES 3300
10 0 Friday 5114139038 1 PAINT AND ACCESSORIES 4415
11 0 Friday 5114197561 1 PAINT AND ACCESSORIES 4415
12 0 Friday 2800053970 1 CANDY, TOBACCO, COOKIES 115
13 0 Friday 7794800902 1 DSD GROCERY 7950
14 0 Friday 7920018317 1 IMPULSE MERCHANDISE 110
15 0 Friday 3500076633 1 PERSONAL CARE 203
16 0 Friday 5460010568 1 HOUSEHOLD CHEMICALS/SUPP 52
17 0 Friday 2899521479 1 FABRICS AND CRAFTS 1059
18 0 Friday 2899521979 1 FABRICS AND CRAFTS 1062
19 0 Friday 1200004300 1 DSD GROCERY 9501
20 0 Friday 88743955560 1 MENS WEAR 144
来自?predict.glm
:
所需的预测类型。默认值在线性预测变量的范围内;备选方案 "response" 在响应变量的范围内。因此,对于默认的二项式模型,默认预测是对数赔率(logit 尺度上的概率)并且 type = "response" 给出预测概率 。 "terms" 选项 returns 一个矩阵,给出模型公式中每一项在线性预测尺度上的拟合值。
所以在你的情况下:
pred_GLM <- predict(fit_GLM, newdata = testtemp, type = "response")