基于 lm 规范在数据集中自动创建变量

Question

我有一个如下所示的数据集：

outcome <- c(1,2, 2, 1, 2, 2, 2, 1)
city <- rep(c("A", "B"),4)
year <- c(2000, 2001, 2000, 2000, 2000, 2001, 2001, 2000)
educ <- c("high", "low", "low", "low", "high", "high", "high", "low")
age <- c(25,35, 30, 29, 31, 40, 20, 23)
area <- c("city", "sub", "sub", "sub", "city", "city", "city", "city")
peopleinhouse <- c(2,4, 3, 3, 5, 2, 1, 2)
adataset <- data.frame(outcome, city, year, educ, age, area, peopleinhouse, stringsAsFactors = FALSE)

我有一个向量如下：

avector <- c("outcome", "as.factor(city)", "year", "as.factor(educ):age", "log(age)", "as.factor(area):peopleinhouse")

我想自动化创建新（交互）变量的过程，将 avector 和 adataset 作为输入，将 anotherdataset 作为输出。

我是这样开始的：

avector <- unlist(strsplit(avector , split=':', fixed=TRUE))

for (i in 1:length(avector)) {
  if (substr(avector[i], start = 1, stop = 9) == "as.factor") {
    adataset[,i] <- as.factor(adataset[,i])
  } else if(substr(avector[i], start = 1, stop = 3) == "log") {
      adataset[,i] <- log(adataset[,i])
  } else if (substr(avector [i], start = 1, stop = 1) == "I") {
      # Assumes a quadratic
      adataset[,i] <- adataset[,i]*adataset[,i]
  }
}

（这应该适用于单个变量（适用于我的实际数据），但我在示例中收到错误 Error in Math.factor(adataset[, i]) : ‘log’ not meaningful for factors）

但我不确定如何自动处理交互项。

我需要交互项是一个变量，因为 ivmodel 来自库 ivmodel。

所需的输出将是一个以交互项作为变量的数据集：

as.factor(educ):age 和 as.factor(area):peopleinhouse 作为一个变量。

有什么想法吗？

编辑：

我目前正朝着以下方向前进：

# For every entry in the vector
for (i in 1:length(avector)) {
# if there is a ":" in the entry
    if (grepl(":", avector[i], fixed = FALSE)) {
# split the string into two string
        tmp <- unlist(strsplit(avector[i], split=':', fixed=FALSE))
# make a new variable
        adataset[,(length(avector)+i)] <- adataset[,tmp[1]]*adataset[,tmp[2]]
# Change the name of the new variable
        names(adataset[,(length(avector)+i)]) <- paste0(names(tmp[1]), names(tmp[2]))
    }
}

但是我无法正确理解语法..

编辑二：

@Edo 在评论中指出 ivmodelFormula 也采用公式。这消除了我的部分问题。但是 ?ivmodelFormula 表明对于多个工具，这些工具仍然由 (a) vector(s) 提供：

# Multiple instruments
Z = card.data[,c("nearc4","nearc2")]
card.model2IV = ivmodelFormula(lwage ~ educ + exper + expersq + black + 
                                south + smsa + reg661 + 
                                reg662 + reg663 + reg664 + 
                                reg665 + reg666 + reg667 + 
                                reg668 + smsa66 | nearc4 + nearc2 +
                                exper + expersq + black + 
                                south + smsa + reg661 + 
                                reg662 + reg663 + reg664 + 
                                reg665 + reg666 + reg667 + 
                                reg668 + smsa66,data=card.data)
card.model2IV

例如，假设我想运行:

require(ivmodel)
Z <- adataset[, c("area", "educ", "peopleinhouse") ]
card.model2IV = ivmodelFormula(outcome ~ log(age) + as.factor(city) + year | area + educ + as.factor(city) + year)
card.model2IV

有效..

但是如果我想运行 area:educ 作为一种交互怎么办，比如：

require(ivmodel)
Z <- adataset[, c("area", "educ", "peopleinhouse")]
another_card.model2IV = ivmodelFormula(outcome ~ log(age) + as.factor(city) + year | area:educ+ as.factor(city) + year)
another_card.model2IV

它运行s，但它给我的输出与 card.model2IV 相同，所以我仍然需要与数据中的仪器进行交互。

Answer 1

我不确定这是否有帮助，但在这种情况下，以自动方式“构建”/“编写”线性模型的公式可能更容易，而不是循环遍历您的列来执行所需的数据转换步骤。

您可以在 paste 函数中使用 collapse 参数将变量向量重写为加法模型
确保使用波浪号 (~) 来识别因变量
使用有用的 as.formula 函数将公式的字符串版本转换为 R 可以计算的内容

这是您的示例，包括作为变量的交互项（注意，我添加了更多随机数据点以在模型中添加一些自由度；如果有足够的数据点，lm 也应该显示标准开发者）：

outcome <- c(1,2, 2, 1, 2, 2, 2, 1)
city <- rep(c("A", "B"),4)
year <- c(2000, 2001, 2000, 2000, 2000, 2001, 2001, 2000)
educ <- c("high", "low", "low", "low", "high", "high", "high", "low")
age <- c(25,35, 30, 29, 31, 40, 20, 23)
area <- c("city", "sub", "sub", "sub", "city", "city", "city", "city")
peopleinhouse <- c(2,4, 3, 3, 5, 2, 1, 2)
adataset <- data.frame(outcome, city, year, educ, age, area, peopleinhouse, stringsAsFactors = FALSE)

avector <- c("outcome", "as.factor(city)", "year", "as.factor(educ):age", "log(age)", "as.factor(area):peopleinhouse")

tmp <- lm(data = adataset, formula = as.formula(paste0("outcome~", paste(avector[2:length(avector)], collapse = "+"))))
summary(tmp)

Call:
lm(formula = as.formula(paste0("outcome~", paste(avector[2:length(avector)], 
    collapse = "+"))), data = adataset)

Residuals:
ALL 8 residuals are 0: no residual degrees of freedom!

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)
(Intercept)                       1348.2888         NA      NA       NA
as.factor(city)B                    -0.9241         NA      NA       NA
year                                -0.6239         NA      NA       NA
log(age)                           -42.9461         NA      NA       NA
as.factor(educ)high:age              1.5155         NA      NA       NA
as.factor(educ)low:age               1.5318         NA      NA       NA
as.factor(area)city:peopleinhouse    0.3817         NA      NA       NA
as.factor(area)sub:peopleinhouse     0.5092         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 7 and 0 DF,  p-value: NA

Answer 2

为了总结我们在评论中的讨论，我将给出最终答案。希望对查找此内容的其他用户有用。

答案分为三部分：第一部分与第一个问题的解决相关，第二部分与@Tom面临的实际问题相关，第三部分与其他的使用相关包来实现@Tom 的实际目标。

第一部分

最初，问题是在给定初始数据集和字符串向量的情况下自动创建具有特定交互和转换的数据集。

解决问题的好方法是按照@krfurlong 的建议使用model.matrix。

比如这样。

给定以下数据：

outcome <- c(1,2, 2, 1, 2, 2, 2, 1)
city <- rep(c("A", "B"),4)
year <- c(2000, 2001, 2000, 2000, 2000, 2001, 2001, 2000)
educ <- c("high", "low", "low", "low", "high", "high", "high", "low")
age <- c(25,35, 30, 29, 31, 40, 20, 23)
area <- c("city", "sub", "sub", "sub", "city", "city", "city", "city")
peopleinhouse <- c(2,4, 3, 3, 5, 2, 1, 2)
adataset <- data.frame(outcome, city, year, educ, age, area, peopleinhouse, stringsAsFactors = FALSE)


avector <- c("outcome", "as.factor(city)", "year", "as.factor(educ):age", "log(age)", "as.factor(area):peopleinhouse")

您可以通过这种方式实现您的目标：

make_formula <- function(vec){
    
    x <- paste(vec[-1], collapse = " + ")
    frm <- paste(vec[1], x, sep = " ~ ")
    as.formula(frm)
    
}

as.data.frame(model.matrix(make_formula(avector), adataset))

  (Intercept) as.factor(city)B year log(age) as.factor(educ)high:age as.factor(educ)low:age as.factor(area)city:peopleinhouse as.factor(area)sub:peopleinhouse
1           1                0 2000 3.218876                      25                      0                                 2                                0
2           1                1 2001 3.555348                       0                     35                                 0                                4
3           1                0 2000 3.401197                       0                     30                                 0                                3
4           1                1 2000 3.367296                       0                     29                                 0                                3
5           1                0 2000 3.433987                      31                      0                                 5                                0
6           1                1 2001 3.688879                      40                      0                                 2                                0
7           1                0 2001 2.995732                      20                      0                                 1                                0
8           1                1 2000 3.135494                       0                     23                                 2                                0

第二部分

最终，我们发现@Tom 的实际目标是运行具有 ivmodel 包的特定模型。

解决此问题的最佳方法是使用名为 ivmodelFormula 的 ivmodel 的本机函数，它允许您设置处理转换和交互的公式。

比如这样。

library(ivmodel)

card.model2IV <- ivmodelFormula(outcome ~ log(age) + as.factor(city) + year | area + educ + as.factor(city) + year)

another_card.model2IV <- ivmodelFormula(outcome ~ log(age) + as.factor(city) + year | area:educ+ as.factor(city) + year)

尽管在这种特定情况下打印输出可能看起来相同，但可以注意到 all.equal(card.model2IV, another_card.model2IV) 不是 TRUE，具体来说，card.model2IV$Z 和 another_card.model2IV$Z 看起来符合预期。

card.model2IV$Z

#> 8 x 2 sparse Matrix of class "dgCMatrix"
#>   areasub educlow
#> 1       .       .
#> 2       1       1
#> 3       1       1
#> 4       1       1
#> 5       .       .
#> 6       .       .
#> 7       .       .
#> 8       .       1


another_card.model2IV$Z

#> 8 x 4 sparse Matrix of class "dgCMatrix"
#>   areacity.educhigh areasub.educhigh areacity.educlow areasub.educlow
#> 1                 1                .                .               .
#> 2                 .                .                .               1
#> 3                 .                .                .               1
#> 4                 .                .                .               1
#> 5                 1                .                .               .
#> 6                 1                .                .               .
#> 7                 1                .                .               .
#> 8                 .                .                1               .

第三部分

最后，我们发现该包无法处理多个检测变量。

鉴于我不是这方面的专家，如果其他用户对此有更多了解，他们可能会随时更新此答案以使其更有用。

还有其他包可以处理这种模型。

查看：

ivreg 来自 AER 包裹
felm 来自 lfe 包裹

基于 lm 规范在数据集中自动创建变量

Automating variable creation in the dataset based on lm specification

r

formula

subset