为什么我的变量是线性相关的?回归、Diff-n-diff、交互项和虚拟变量
Why is my variables linearly dependent? Regression, Diff-n-diff, interaction term and dummies
我创建了一个小型数据框来测试差异中的差异,以便对方法和理论有一些直觉。我想我有两个问题。
- 为什么 free_cookies 和 free_cookies*teenager = 1 之间的相关性?
- 有没有办法修正数据,让回归lm(cookies_eaten ~ teener + free_cookies + teenal*free_cookies, data), 不丢交互项(free_cookies*青少年)?
应该可以运行 回归格式
outcome ~ dummy1 + dummy2 + dummy1*dummy2
并获得所有自变量的 系数估计值 ,我在其他地方看到过。需要明确的是:青少年和 free_cookies 是虚拟变量。我猜我在构建示例数据时做了一些愚蠢的事情。
# cookie eating data
data <- read.table(text = "
year cookies_eaten teenager free_cookies
2000 110 1 0
2001 110 1 0
2002 120 1 0
2003 120 1 0
2004 125 1 0
2005 125 1 0
2006 125 1 0
2007 145 1 1
2008 155 1 1
2009 160 1 1
2010 160 1 1
2000 100 0 0
2001 100 0 0
2002 110 0 0
2003 110 0 0
2004 115 0 0
2005 115 0 0
2006 115 0 0
2007 115 0 0
2008 115 0 0
2009 120 0 0
2010 120 0 0", header=TRUE)
# Regressions
one <- lm(cookies_eaten ~ teenager, data)
summary(one)
two <- lm(cookies_eaten ~ teenager + free_cookies, data)
summary(two)
three <- lm(cookies_eaten ~ teenager + free_cookies + teenager*free_cookies, data)
summary(three) # Coefficients: (1 not defined because of singularities)
# four without free_cookies
four <- lm(cookies_eaten ~ teenager + teenager*free_cookies, data)
summary(four) # Coefficients: (1 not defined because of singularities)
# Corrolation testing
attach(data)
cor(free_cookies, free_cookies*teenager, method = c("pearson", "kendall", "spearman"))
# = 1
cor(cookies_eaten, free_cookies*teenager, method = c("pearson", "kendall", "spearman"))
# = 0.9090648
detach(data)
查看数据可以很容易地看到每当 teenager == 0
时也有 free_cookies==0
所以这些数据是完美对齐的。当 teenager==1
free_cookies
的每个值都乘以 1
时,free_cookies
上的任何值都不会改变,这就是为什么 free_cookies
和 teenager times free_cookies
是始终是相同的值,因此相关性为 1
。使用这些数据,您无法调查相互作用。您需要在 teenager == 0 and free_cookies ==1
.
处采样一些数据
data <- read.table(text = "
year cookies_eaten teenager free_cookies
2000 110 1 0
2001 110 1 0
2002 120 1 0
2003 120 1 0
2004 125 1 0
2005 125 1 0
2006 125 1 0
2007 145 1 1
2008 155 1 1
2009 160 1 1
2010 160 1 1
2000 100 0 0
2001 100 0 0
2002 110 0 0
2003 110 0 0
2004 115 0 0
2005 115 0 0
2006 115 0 0
2007 115 0 0
2008 115 0 0
2009 120 0 0
2010 120 0 0", header=TRUE)
data$interaction <- data$teenager * data$free_cookies
print(data[, c("free_cookies", "interaction")])
any(data$free_cookies != data$interaction)
我创建了一个小型数据框来测试差异中的差异,以便对方法和理论有一些直觉。我想我有两个问题。
- 为什么 free_cookies 和 free_cookies*teenager = 1 之间的相关性?
- 有没有办法修正数据,让回归lm(cookies_eaten ~ teener + free_cookies + teenal*free_cookies, data), 不丢交互项(free_cookies*青少年)?
应该可以运行 回归格式
outcome ~ dummy1 + dummy2 + dummy1*dummy2
并获得所有自变量的 系数估计值 ,我在其他地方看到过。需要明确的是:青少年和 free_cookies 是虚拟变量。我猜我在构建示例数据时做了一些愚蠢的事情。
# cookie eating data
data <- read.table(text = "
year cookies_eaten teenager free_cookies
2000 110 1 0
2001 110 1 0
2002 120 1 0
2003 120 1 0
2004 125 1 0
2005 125 1 0
2006 125 1 0
2007 145 1 1
2008 155 1 1
2009 160 1 1
2010 160 1 1
2000 100 0 0
2001 100 0 0
2002 110 0 0
2003 110 0 0
2004 115 0 0
2005 115 0 0
2006 115 0 0
2007 115 0 0
2008 115 0 0
2009 120 0 0
2010 120 0 0", header=TRUE)
# Regressions
one <- lm(cookies_eaten ~ teenager, data)
summary(one)
two <- lm(cookies_eaten ~ teenager + free_cookies, data)
summary(two)
three <- lm(cookies_eaten ~ teenager + free_cookies + teenager*free_cookies, data)
summary(three) # Coefficients: (1 not defined because of singularities)
# four without free_cookies
four <- lm(cookies_eaten ~ teenager + teenager*free_cookies, data)
summary(four) # Coefficients: (1 not defined because of singularities)
# Corrolation testing
attach(data)
cor(free_cookies, free_cookies*teenager, method = c("pearson", "kendall", "spearman"))
# = 1
cor(cookies_eaten, free_cookies*teenager, method = c("pearson", "kendall", "spearman"))
# = 0.9090648
detach(data)
查看数据可以很容易地看到每当 teenager == 0
时也有 free_cookies==0
所以这些数据是完美对齐的。当 teenager==1
free_cookies
的每个值都乘以 1
时,free_cookies
上的任何值都不会改变,这就是为什么 free_cookies
和 teenager times free_cookies
是始终是相同的值,因此相关性为 1
。使用这些数据,您无法调查相互作用。您需要在 teenager == 0 and free_cookies ==1
.
data <- read.table(text = "
year cookies_eaten teenager free_cookies
2000 110 1 0
2001 110 1 0
2002 120 1 0
2003 120 1 0
2004 125 1 0
2005 125 1 0
2006 125 1 0
2007 145 1 1
2008 155 1 1
2009 160 1 1
2010 160 1 1
2000 100 0 0
2001 100 0 0
2002 110 0 0
2003 110 0 0
2004 115 0 0
2005 115 0 0
2006 115 0 0
2007 115 0 0
2008 115 0 0
2009 120 0 0
2010 120 0 0", header=TRUE)
data$interaction <- data$teenager * data$free_cookies
print(data[, c("free_cookies", "interaction")])
any(data$free_cookies != data$interaction)