如何在数据集列表中找到公共变量并在 R 中重塑它们？

Question

    setwd("C:\Users\DATA")
    temp = list.files(pattern="*.dta")
    for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i], nonint.factors = TRUE))
    grep(pattern="_m", temp, value=TRUE)

在这里，我创建了一个数据集列表并将它们读入 R，然后我尝试使用 grep 来查找所有具有模式 _m 的变量名称，显然这不起作用，因为这只是 returns 模式为 _m 的所有文件名。所以基本上我想要的是我的代码循环遍历数据库列表，查找以 _m 结尾的变量，以及 return 包含这些变量的数据库列表。

现在我不太确定该怎么做，我对编码和 R 还很陌生。

除了需要知道这些变量在哪些数据库中之外，我还需要能够对这些变量进行更改（重塑）。

Answer 1

这是一种确定哪些文件的变量名称以“_m”结尾的方法：

# setup
setwd("C:\Users\DATA")
temp = list.files(pattern="*.dta")
# logical vector to be filled in
inFileVec <- logical(length(temp))

# loop through each file
for (i in 1:length(temp)) {
  # read file
  fileTemp <- read.dta13(temp[i], nonint.factors = TRUE)

  # fill in vector with TRUE if any variable ends in "_m"
  inFileVec[i] <- any(grepl("_m$", names(fileTemp)))
}

在最后一行，names returns 变量名，grepl returns 一个逻辑向量，表示每个变量名是否与模式匹配，any returns 长度为 1 的逻辑向量，指示是否从 grepl.

返回了至少一个 TRUE

# print out these file names    
temp[inFileVec]

Answer 2

首先，assign 不会像您想象的那样工作，因为它需要一个字符串（或字符，因为它们在 R 中被称为）。它将使用第一个元素作为变量（有关详细信息，请参阅 here）。

您能做什么取决于您的数据结构。 read.dta13 会将每个文件加载为 data.frame.

如果您查找列名，您可以这样做：

myList <- character()
for (i in 1:length(temp)) {

    # save the content of your file in a data frame
    df <- read.dta13(temp[i], nonint.factors = TRUE))

    # identify the names of the columns matching your pattern
    varMatch <- grep(pattern="_m", colnames(df), value=TRUE)

    # check if at least one of the columns match the pattern
    if (length(varMatch)) {
        myList <- c(myList, temp[i]) # save the name if match
    }

}

如果要查找列的内容，可以查看 dplyr 包，它在数据帧操作方面非常有用。

包 vignette here.

中提供了对 dplyr 的一个很好的介绍

请注意，在 R 中，附加到向量可能会变得非常慢（有关详细信息，请参阅此 SO question）。

如何在数据集列表中找到公共变量并在 R 中重塑它们？

How to find common variables in a list of datasets & reshape them in R?

variables

loops

r

dataset

reshape