如何创建具有范围的虚拟变量

Question

我正在尝试创建价格范围并针对价格范围虚拟变量创建 lm 模型。所以我做了：

> #price range 
> airbnblisting$PriceRange[price <= 500] <- 0 
> airbnblisting$PriceRange[price > 500 & price <= 1000] <- 1
> airbnblisting$PriceRange[price > 1000] <- 2

然后运行:

> r1 <- lm(review_scores_rating ~ PriceRange, data=airbnblisting,)
> summary(r1)

但结果显示 priceRange 为 NA。知道我能让 priceRange 正常工作吗？

    Min      1Q  Median      3Q     Max 
-4.7619 -0.0319  0.1281  0.2381  0.2381 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.761914   0.003115    1529   <2e-16 ***
PriceRange        NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

价格示例：

Answer 1

美元 $ 表示您使用的是字符串而不是数字。您需要先清理数据。

您目前正在做

dat$PriceRange[dat$price <= 500] <- 0 
dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
dat$PriceRange[dat$price > 1000] <- 2

产生全零

dat$PriceRange
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0

因此：

lm(review ~ PriceRange, dat)$coe
# (Intercept)  PriceRange 
#   2.538462          NA

现在，我们用 gsub 清理 price，删除 $（需要转义）|（或）, 1000 个分隔符。

dat <- transform(dat, price=as.numeric(gsub('\$|,', '', price)))

现在，价格将被正确识别为数字

dat$PriceRange[dat$price <= 500] <- 0 
dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
dat$PriceRange[dat$price > 1000] <- 2

dat$PriceRange
# [1] 0 0 2 0 1 2 0 0 2 0 0 0 2 0

lm 应该可以。

lm(review ~ PriceRange, dat)$coe
# (Intercept)  PriceRange 
#   2.5350318  -0.1656051

您可以更轻松地使用 cut 创建虚拟变量（假设数据已经干净）。

dat <- transform(dat,
                 PriceRange=as.numeric(cut(price, c(0, 500, 1000, Inf), 
                                           labels=0:2)))
lm(review ~ PriceRange, dat)$coe
# (Intercept)  PriceRange 
#   2.7006369  -0.1656051

请注意，您试图将分类变量编码为连续变量，这在统计上可能存在问题！

数据：

dat <- structure(list(review = c(4L, 4L, 1L, 3L, 2L, 2L, 3L, 0L, 2L, 
3L, 2L, 3L, 4L, 1L), price = c("2.00", "9.00", "40.00", 
"4.00", "9.00", "90.00", "9.00", ".10", "00.00", 
"0.00", "3.00", ".00", ",258.00", "0.00")), class = "data.frame", row.names = c(NA, 
-14L))

如何创建具有范围的虚拟变量

How to create dummy variable with range

r

dummy-variable