为什么我的变量是线性相关的?回归、Diff-n-diff、交互项和虚拟变量

Why is my variables linearly dependent? Regression, Diff-n-diff, interaction term and dummies

我创建了一个小型数据框来测试差异中的差异,以便对方法和理论有一些直觉。我想我有两个问题。

  1. 为什么 free_cookies 和 free_cookies*teenager = 1 之间的相关性?
  2. 有没有办法修正数据,让回归lm(cookies_eaten ~ teener + free_cookies + teenal*free_cookies, data), 不丢交互项(free_cookies*青少年)?

应该可以运行 回归格式

outcome ~ dummy1 + dummy2 + dummy1*dummy2

并获得所有自变量的 系数估计值 ,我在其他地方看到过。需要明确的是:青少年和 free_cookies 是虚拟变量。我猜我在构建示例数据时做了一些愚蠢的事情。

# cookie eating data
data <- read.table(text = "

year    cookies_eaten   teenager    free_cookies
2000    110 1   0
2001    110 1   0
2002    120 1   0
2003    120 1   0
2004    125 1   0
2005    125 1   0
2006    125 1   0
2007    145 1   1
2008    155 1   1
2009    160 1   1
2010    160 1   1
2000    100 0   0
2001    100 0   0
2002    110 0   0
2003    110 0   0
2004    115 0   0
2005    115 0   0
2006    115 0   0
2007    115 0   0
2008    115 0   0
2009    120 0   0
2010    120 0   0", header=TRUE)


# Regressions
one <- lm(cookies_eaten ~ teenager, data)
summary(one)

two <- lm(cookies_eaten ~ teenager + free_cookies, data)
summary(two)

three <- lm(cookies_eaten ~ teenager + free_cookies + teenager*free_cookies, data)
summary(three) # Coefficients: (1 not defined because of singularities)

# four without free_cookies
four <- lm(cookies_eaten ~ teenager + teenager*free_cookies, data)
summary(four) # Coefficients: (1 not defined because of singularities)

# Corrolation testing
attach(data)
cor(free_cookies, free_cookies*teenager, method = c("pearson", "kendall", "spearman"))
# = 1
cor(cookies_eaten, free_cookies*teenager, method = c("pearson", "kendall", "spearman"))
# = 0.9090648
detach(data)

查看数据可以很容易地看到每当 teenager == 0 时也有 free_cookies==0 所以这些数据是完美对齐的。当 teenager==1 free_cookies 的每个值都乘以 1 时,free_cookies 上的任何值都不会改变,这就是为什么 free_cookiesteenager times free_cookies 是始终是相同的值,因此相关性为 1。使用这些数据,您无法调查相互作用。您需要在 teenager == 0 and free_cookies ==1.

处采样一些数据
data <- read.table(text = "
year    cookies_eaten   teenager    free_cookies
2000    110 1   0
2001    110 1   0
2002    120 1   0
2003    120 1   0
2004    125 1   0
2005    125 1   0
2006    125 1   0
2007    145 1   1
2008    155 1   1
2009    160 1   1
2010    160 1   1
2000    100 0   0
2001    100 0   0
2002    110 0   0
2003    110 0   0
2004    115 0   0
2005    115 0   0
2006    115 0   0
2007    115 0   0
2008    115 0   0
2009    120 0   0
2010    120 0   0", header=TRUE)

data$interaction <- data$teenager * data$free_cookies

print(data[, c("free_cookies", "interaction")])

any(data$free_cookies != data$interaction)