使用 space 分隔在 R 中加载表

Loading tables in R with space seperation

如何在字段中加载 space 分隔 table 和 space?

简单案例数据:

Grade Area School Goals
4 Rural Elm Popular
4 Rural Elm Sports
4 Rural Elm Grades
4 Rural Elm Popular
3 Rural Brentwood Elementary Sports
3 Suburban Ridge Popular

注意最后一个元素在命名学校时如何使用 space 分隔("Brentwood Elementary" 而不是 "Elm")

以下查询失败:"line x did not have y elements"

dat = read.table("dat.txt",header=TRUE)

编辑: 数据点都是因素,包含一组值

编辑:完整数据可通过 http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html 感谢@AmandaMahto

实际上,如果您可以使用 Ananda 找到的数据源,这很容易,因为 <pre> 区域是制表符分隔的:

library(rvest)

pg <- html("http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html")
dat <- pg %>% html_nodes("pre") %>% html_text() 
dat <-  read.table(text=dat, sep="\t", header=TRUE, stringsAsFactors=FALSE)

dat[245:249,]

##     Gender Grade Age  Race Urban.Rural       School   Goals Grades Sports Looks Money
## 245   girl     4   9 White       Rural         Sand  Grades      1      3     2     4
## 246   girl     4   9 White       Rural         Sand  Sports      3      2     1     4
## 247   girl     4   9 White       Rural         Sand  Sports      3      2     1     4
## 248   girl     4   9 White       Rural         Sand  Grades      2      1     3     4
## 249   girl     6  12 White       Rural Brown Middle Popular      4      2     1     3

要真正回答您的问题(这有点像 Ananda 的回答),您需要知道问题列在哪里并解决它。这个使用 gsubfn 和该列的预定义值来构成整体然后拆分:

library(gsubfn)

# awful.txt is here https://gist.github.com/hrbrmstr/13cee15c91fdadb10fbc

lines <- readLines("awful.txt")

schools <- c("Brentwood Elementary", "Brentwood Middle", "Brown Middle", "Westdale Middle")
expr <- paste("(", paste(schools, collapse="|"), ")", sep="")
lines <- gsubfn(expr, function(x) { gsub(" ", "_", x) }, lines)

dat <- read.table(text=paste(lines, sep="", collapse="\n"), 
                  header=TRUE, stringsAsFactors=FALSE)

dat$School <- gsub("_", " ", dat$School)

dat[c(1,34,94,198,255,324,377,433),]

##     Gender Grade Age  Race Urban.Rural               School   Goals Grades Sports Looks Money
## 1      boy     5  11 White       Rural                  Elm  Sports      1      2     4     3
## 34     boy     4  10 White    Suburban Brentwood Elementary  Grades      2      1     3     4
## 94    girl     6  11 White    Suburban     Brentwood Middle  Grades      3      4     1     2
## 198    boy     5  10 White       Rural                Ridge  Sports      4      2     1     3
## 255   girl     6  12 Other       Rural         Brown Middle  Grades      3      2     1     4
## 324    boy     4   9 Other       Urban                 Main  Grades      4      1     3     2
## 377    boy     4   9 White       Urban              Portage Popular      4      1     2     3
## 433   girl     6  11 White       Urban      Westdale Middle Popular      4      2     1     3

不幸的是,这个问题的答案差不多"It depends on how much you know about the data set."

例如,在数据集的描述中,它指定了每个变量的可能值。在这里,我们知道只有少数学校使用多词名称,并且这些学校遵循 "Elementary" 和 "Middle".

的可预测模式

因此,您可以使用 readLines 读取数据,并在使用 read.table.

重新读取数据之前找出插入定界符的最不突兀的方式

这是一个例子:

示例数据:

cat("Grade Area School Goals Value",
    "4 Rural Elm Popular 1",
    "4 Rural Elm Sports 2",
    "4 Rural Elm Grades 1",
    "4 Rural Elm Popular 3",
    "3 Rural Brentwood Elementary Sports 4",
    "3 Rural Brentwood Middle Grades 3",
    "3 Suburban Ridge Popular 3", sep = "\n", file = "test.txt")

将其作为字符向量读入:

x <- readLines("test.txt")

使用gsub强制多字校名变成单字(下划线分隔)。然后,使用 read.table 得到你的 data.frame.

read.table(text = gsub(" (Elementary|Middle)", "_\1", x), header = TRUE)
#   Grade     Area               School   Goals Value
# 1     4    Rural                  Elm Popular     1
# 2     4    Rural                  Elm  Sports     2
# 3     4    Rural                  Elm  Grades     1
# 4     4    Rural                  Elm Popular     3
# 5     3    Rural Brentwood_Elementary  Sports     4
# 6     3    Rural     Brentwood_Middle  Grades     3
# 7     3 Suburban                Ridge Popular     3