使用 space 分隔在 R 中加载表
Loading tables in R with space seperation
如何在字段中加载 space 分隔 table 和 space?
简单案例数据:
Grade Area School Goals
4 Rural Elm Popular
4 Rural Elm Sports
4 Rural Elm Grades
4 Rural Elm Popular
3 Rural Brentwood Elementary Sports
3 Suburban Ridge Popular
注意最后一个元素在命名学校时如何使用 space 分隔("Brentwood Elementary" 而不是 "Elm")
以下查询失败:"line x did not have y elements"
dat = read.table("dat.txt",header=TRUE)
编辑:
数据点都是因素,包含一组值
编辑:完整数据可通过 http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html
感谢@AmandaMahto
实际上,如果您可以使用 Ananda 找到的数据源,这很容易,因为 <pre>
区域是制表符分隔的:
library(rvest)
pg <- html("http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html")
dat <- pg %>% html_nodes("pre") %>% html_text()
dat <- read.table(text=dat, sep="\t", header=TRUE, stringsAsFactors=FALSE)
dat[245:249,]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 245 girl 4 9 White Rural Sand Grades 1 3 2 4
## 246 girl 4 9 White Rural Sand Sports 3 2 1 4
## 247 girl 4 9 White Rural Sand Sports 3 2 1 4
## 248 girl 4 9 White Rural Sand Grades 2 1 3 4
## 249 girl 6 12 White Rural Brown Middle Popular 4 2 1 3
要真正回答您的问题(这有点像 Ananda 的回答),您需要知道问题列在哪里并解决它。这个使用 gsubfn
和该列的预定义值来构成整体然后拆分:
library(gsubfn)
# awful.txt is here https://gist.github.com/hrbrmstr/13cee15c91fdadb10fbc
lines <- readLines("awful.txt")
schools <- c("Brentwood Elementary", "Brentwood Middle", "Brown Middle", "Westdale Middle")
expr <- paste("(", paste(schools, collapse="|"), ")", sep="")
lines <- gsubfn(expr, function(x) { gsub(" ", "_", x) }, lines)
dat <- read.table(text=paste(lines, sep="", collapse="\n"),
header=TRUE, stringsAsFactors=FALSE)
dat$School <- gsub("_", " ", dat$School)
dat[c(1,34,94,198,255,324,377,433),]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 1 boy 5 11 White Rural Elm Sports 1 2 4 3
## 34 boy 4 10 White Suburban Brentwood Elementary Grades 2 1 3 4
## 94 girl 6 11 White Suburban Brentwood Middle Grades 3 4 1 2
## 198 boy 5 10 White Rural Ridge Sports 4 2 1 3
## 255 girl 6 12 Other Rural Brown Middle Grades 3 2 1 4
## 324 boy 4 9 Other Urban Main Grades 4 1 3 2
## 377 boy 4 9 White Urban Portage Popular 4 1 2 3
## 433 girl 6 11 White Urban Westdale Middle Popular 4 2 1 3
不幸的是,这个问题的答案差不多"It depends on how much you know about the data set."
例如,在数据集的描述中,它指定了每个变量的可能值。在这里,我们知道只有少数学校使用多词名称,并且这些学校遵循 "Elementary" 和 "Middle".
的可预测模式
因此,您可以使用 readLines
读取数据,并在使用 read.table
.
重新读取数据之前找出插入定界符的最不突兀的方式
这是一个例子:
示例数据:
cat("Grade Area School Goals Value",
"4 Rural Elm Popular 1",
"4 Rural Elm Sports 2",
"4 Rural Elm Grades 1",
"4 Rural Elm Popular 3",
"3 Rural Brentwood Elementary Sports 4",
"3 Rural Brentwood Middle Grades 3",
"3 Suburban Ridge Popular 3", sep = "\n", file = "test.txt")
将其作为字符向量读入:
x <- readLines("test.txt")
使用gsub
强制多字校名变成单字(下划线分隔)。然后,使用 read.table
得到你的 data.frame
.
read.table(text = gsub(" (Elementary|Middle)", "_\1", x), header = TRUE)
# Grade Area School Goals Value
# 1 4 Rural Elm Popular 1
# 2 4 Rural Elm Sports 2
# 3 4 Rural Elm Grades 1
# 4 4 Rural Elm Popular 3
# 5 3 Rural Brentwood_Elementary Sports 4
# 6 3 Rural Brentwood_Middle Grades 3
# 7 3 Suburban Ridge Popular 3
如何在字段中加载 space 分隔 table 和 space?
简单案例数据:
Grade Area School Goals
4 Rural Elm Popular
4 Rural Elm Sports
4 Rural Elm Grades
4 Rural Elm Popular
3 Rural Brentwood Elementary Sports
3 Suburban Ridge Popular
注意最后一个元素在命名学校时如何使用 space 分隔("Brentwood Elementary" 而不是 "Elm")
以下查询失败:"line x did not have y elements"
dat = read.table("dat.txt",header=TRUE)
编辑: 数据点都是因素,包含一组值
编辑:完整数据可通过 http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html 感谢@AmandaMahto
实际上,如果您可以使用 Ananda 找到的数据源,这很容易,因为 <pre>
区域是制表符分隔的:
library(rvest)
pg <- html("http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html")
dat <- pg %>% html_nodes("pre") %>% html_text()
dat <- read.table(text=dat, sep="\t", header=TRUE, stringsAsFactors=FALSE)
dat[245:249,]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 245 girl 4 9 White Rural Sand Grades 1 3 2 4
## 246 girl 4 9 White Rural Sand Sports 3 2 1 4
## 247 girl 4 9 White Rural Sand Sports 3 2 1 4
## 248 girl 4 9 White Rural Sand Grades 2 1 3 4
## 249 girl 6 12 White Rural Brown Middle Popular 4 2 1 3
要真正回答您的问题(这有点像 Ananda 的回答),您需要知道问题列在哪里并解决它。这个使用 gsubfn
和该列的预定义值来构成整体然后拆分:
library(gsubfn)
# awful.txt is here https://gist.github.com/hrbrmstr/13cee15c91fdadb10fbc
lines <- readLines("awful.txt")
schools <- c("Brentwood Elementary", "Brentwood Middle", "Brown Middle", "Westdale Middle")
expr <- paste("(", paste(schools, collapse="|"), ")", sep="")
lines <- gsubfn(expr, function(x) { gsub(" ", "_", x) }, lines)
dat <- read.table(text=paste(lines, sep="", collapse="\n"),
header=TRUE, stringsAsFactors=FALSE)
dat$School <- gsub("_", " ", dat$School)
dat[c(1,34,94,198,255,324,377,433),]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 1 boy 5 11 White Rural Elm Sports 1 2 4 3
## 34 boy 4 10 White Suburban Brentwood Elementary Grades 2 1 3 4
## 94 girl 6 11 White Suburban Brentwood Middle Grades 3 4 1 2
## 198 boy 5 10 White Rural Ridge Sports 4 2 1 3
## 255 girl 6 12 Other Rural Brown Middle Grades 3 2 1 4
## 324 boy 4 9 Other Urban Main Grades 4 1 3 2
## 377 boy 4 9 White Urban Portage Popular 4 1 2 3
## 433 girl 6 11 White Urban Westdale Middle Popular 4 2 1 3
不幸的是,这个问题的答案差不多"It depends on how much you know about the data set."
例如,在数据集的描述中,它指定了每个变量的可能值。在这里,我们知道只有少数学校使用多词名称,并且这些学校遵循 "Elementary" 和 "Middle".
的可预测模式因此,您可以使用 readLines
读取数据,并在使用 read.table
.
这是一个例子:
示例数据:
cat("Grade Area School Goals Value",
"4 Rural Elm Popular 1",
"4 Rural Elm Sports 2",
"4 Rural Elm Grades 1",
"4 Rural Elm Popular 3",
"3 Rural Brentwood Elementary Sports 4",
"3 Rural Brentwood Middle Grades 3",
"3 Suburban Ridge Popular 3", sep = "\n", file = "test.txt")
将其作为字符向量读入:
x <- readLines("test.txt")
使用gsub
强制多字校名变成单字(下划线分隔)。然后,使用 read.table
得到你的 data.frame
.
read.table(text = gsub(" (Elementary|Middle)", "_\1", x), header = TRUE)
# Grade Area School Goals Value
# 1 4 Rural Elm Popular 1
# 2 4 Rural Elm Sports 2
# 3 4 Rural Elm Grades 1
# 4 4 Rural Elm Popular 3
# 5 3 Rural Brentwood_Elementary Sports 4
# 6 3 Rural Brentwood_Middle Grades 3
# 7 3 Suburban Ridge Popular 3