有没有一种简单的方法可以使用 Stata .do 脚本将固定宽度的数据导入 R?
Is there a simple way to use a Stata .do script to import fixed width data into R?
我有一个 Stata .do 脚本文件,用于从固定宽度的 TXT 文件导入数据。 .do 文件如下所示:
#delimit ;
**************************************************************************
Label : CDS 2014 ID Map
Rows : 4353
Columns : 7
ASCII File Date : December 11, 2017
*************************************************************************;
infix
CHLDID14 1 - 5 CHLDSN14 6 - 7 PCGID14 8 - 12
PCGSN14 13 - 14 CDSHID14 15 - 18 CHLDINST14 19 - 20
PCGINST14 21 - 22
using [path]\IDMAP14.txt, clear
;
label variable CHLDID14 "CHILD 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable CHLDSN14 "CHILD 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable PCGID14 "PCG 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable PCGSN14 "PCG 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable CDSHID14 "CDS 2014 HOUSEHOLD INTERVIEW NUMBER" ;
label variable CHLDINST14 "CDS 2014 HH ROSTER CHILD SEQUENCE NUM" ;
label variable PCGINST14 "CDS 2014 HH ROSTER PCG SEQUENCE NUM" ;
有没有一种快速的方法可以使用这个.do文件自动将数据导入R?还是我必须使用列范围手动调整脚本?
我问是因为我只能访问 R(不是 Stata),但 Stata .do 文件似乎是将数据正确导入 R 的最简单快捷方式。
谢谢!
Link 到文件:Fixed-width text file and Stata .do script
这是一个尝试,但由于我们没有可用于验证的文件,因此您只能靠自己了。我对这种格式做出的许多假设可能需要验证,即:
- 我们在带有文字
infix
和文字 using
的行之间找到列定义
- 每个列的定义都是
columnname from hyphen to
,有空格(即使是单个字符,也是somename 5 - 5
)
- 文件名紧跟文字
using
;尾随逗号可能后跟 clear
或其他 non-comma 字符,而不是文件名的一部分
do2fwf <- function(txt) {
infix <- grep("^infix\s*", txt)
if (length(infix) != 1L) stop("need exactly one 'infix' line")
using <- grep("^\s*using\b", txt)
if (length(using) != 1L) stop("need exactly one 'using' line")
if (using < infix) stop("'infix' must occur before 'using'")
hdrtxt <- txt[ (infix+1):(using-1) ]
# " CHLDID14 1 - 5 CHLDSN14 6 - 7 PCGID14 8 - 12 "
re <- gregexpr("\S+", hdrtxt)
m <- regmatches(hdrtxt, re)
# [[1]]
# [1] "CHLDID14" "1" "-" "5" "CHLDSN14" "6" "-" "7" "PCGID14" "8"
# [11] "-" "12"
if (!all(lengths(m) %% 4 == 0))
stop("not all variables are the right format of 'name i - j'")
if (any(lengths(m) == 0)) {
warning("found empty lines, confusing")
m <- Filter(length, m)
}
# need to convert 4x lists into 1x lists
m2 <- do.call("c", mapply(split, m, lapply(lengths(m), function(a) (1:a-1) %/% 4)))
nms <- sapply(m2, `[[`, 1)
froms <- as.integer(sapply(m2, `[[`, 2))
tos <- as.integer(sapply(m2, `[[`, 4))
widths <- tos - froms + 1
filename <- gsub("^\s*using\s*", "", txt[using])
# this works here, but I don't know if it is generic and rule-following
filename <- gsub("\s*,[^,]*$", "", filename)
list(filename = filename, names = nms, widths = widths)
# x <- read.fwf(filename, widths=widths, ...) # header=FALSE???
# colnames(x) <- names
}
如果你使用底部的数据(实际上应该是 txt <- readLines("somefile.do")
,你会得到这个:
do2fwf(txt)
# $filename
# [1] "[path]\IDMAP14.txt"
# $names
# 0 1 2 0 1 2 0
# "CHLDID14" "CHLDSN14" "PCGID14" "PCGSN14" "CDSHID14" "CHLDINST14" "PCGINST14"
# $widths
# [1] 5 2 5 2 4 2 2
您可以自己使用(根据评论)。我不知道 header 行或 read.fwf
可能需要的其他参数。祝你好运!
正文:
txt <- readLines(textConnection('**************************************************************************
Label : CDS 2014 ID Map
Rows : 4353
Columns : 7
ASCII File Date : December 11, 2017
*************************************************************************;
infix
CHLDID14 1 - 5 CHLDSN14 6 - 7 PCGID14 8 - 12
PCGSN14 13 - 14 CDSHID14 15 - 18 CHLDINST14 19 - 20
PCGINST14 21 - 22
using [path]\IDMAP14.txt, clear
;
label variable CHLDID14 "CHILD 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable CHLDSN14 "CHILD 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable PCGID14 "PCG 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable PCGSN14 "PCG 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable CDSHID14 "CDS 2014 HOUSEHOLD INTERVIEW NUMBER" ;
label variable CHLDINST14 "CDS 2014 HH ROSTER CHILD SEQUENCE NUM" ;
label variable PCGINST14 "CDS 2014 HH ROSTER PCG SEQUENCE NUM" ; '))
我有一个 Stata .do 脚本文件,用于从固定宽度的 TXT 文件导入数据。 .do 文件如下所示:
#delimit ;
**************************************************************************
Label : CDS 2014 ID Map
Rows : 4353
Columns : 7
ASCII File Date : December 11, 2017
*************************************************************************;
infix
CHLDID14 1 - 5 CHLDSN14 6 - 7 PCGID14 8 - 12
PCGSN14 13 - 14 CDSHID14 15 - 18 CHLDINST14 19 - 20
PCGINST14 21 - 22
using [path]\IDMAP14.txt, clear
;
label variable CHLDID14 "CHILD 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable CHLDSN14 "CHILD 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable PCGID14 "PCG 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable PCGSN14 "PCG 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable CDSHID14 "CDS 2014 HOUSEHOLD INTERVIEW NUMBER" ;
label variable CHLDINST14 "CDS 2014 HH ROSTER CHILD SEQUENCE NUM" ;
label variable PCGINST14 "CDS 2014 HH ROSTER PCG SEQUENCE NUM" ;
有没有一种快速的方法可以使用这个.do文件自动将数据导入R?还是我必须使用列范围手动调整脚本?
我问是因为我只能访问 R(不是 Stata),但 Stata .do 文件似乎是将数据正确导入 R 的最简单快捷方式。
谢谢!
Link 到文件:Fixed-width text file and Stata .do script
这是一个尝试,但由于我们没有可用于验证的文件,因此您只能靠自己了。我对这种格式做出的许多假设可能需要验证,即:
- 我们在带有文字
infix
和文字using
的行之间找到列定义
- 每个列的定义都是
columnname from hyphen to
,有空格(即使是单个字符,也是somename 5 - 5
) - 文件名紧跟文字
using
;尾随逗号可能后跟clear
或其他 non-comma 字符,而不是文件名的一部分
do2fwf <- function(txt) {
infix <- grep("^infix\s*", txt)
if (length(infix) != 1L) stop("need exactly one 'infix' line")
using <- grep("^\s*using\b", txt)
if (length(using) != 1L) stop("need exactly one 'using' line")
if (using < infix) stop("'infix' must occur before 'using'")
hdrtxt <- txt[ (infix+1):(using-1) ]
# " CHLDID14 1 - 5 CHLDSN14 6 - 7 PCGID14 8 - 12 "
re <- gregexpr("\S+", hdrtxt)
m <- regmatches(hdrtxt, re)
# [[1]]
# [1] "CHLDID14" "1" "-" "5" "CHLDSN14" "6" "-" "7" "PCGID14" "8"
# [11] "-" "12"
if (!all(lengths(m) %% 4 == 0))
stop("not all variables are the right format of 'name i - j'")
if (any(lengths(m) == 0)) {
warning("found empty lines, confusing")
m <- Filter(length, m)
}
# need to convert 4x lists into 1x lists
m2 <- do.call("c", mapply(split, m, lapply(lengths(m), function(a) (1:a-1) %/% 4)))
nms <- sapply(m2, `[[`, 1)
froms <- as.integer(sapply(m2, `[[`, 2))
tos <- as.integer(sapply(m2, `[[`, 4))
widths <- tos - froms + 1
filename <- gsub("^\s*using\s*", "", txt[using])
# this works here, but I don't know if it is generic and rule-following
filename <- gsub("\s*,[^,]*$", "", filename)
list(filename = filename, names = nms, widths = widths)
# x <- read.fwf(filename, widths=widths, ...) # header=FALSE???
# colnames(x) <- names
}
如果你使用底部的数据(实际上应该是 txt <- readLines("somefile.do")
,你会得到这个:
do2fwf(txt)
# $filename
# [1] "[path]\IDMAP14.txt"
# $names
# 0 1 2 0 1 2 0
# "CHLDID14" "CHLDSN14" "PCGID14" "PCGSN14" "CDSHID14" "CHLDINST14" "PCGINST14"
# $widths
# [1] 5 2 5 2 4 2 2
您可以自己使用(根据评论)。我不知道 header 行或 read.fwf
可能需要的其他参数。祝你好运!
正文:
txt <- readLines(textConnection('**************************************************************************
Label : CDS 2014 ID Map
Rows : 4353
Columns : 7
ASCII File Date : December 11, 2017
*************************************************************************;
infix
CHLDID14 1 - 5 CHLDSN14 6 - 7 PCGID14 8 - 12
PCGSN14 13 - 14 CDSHID14 15 - 18 CHLDINST14 19 - 20
PCGINST14 21 - 22
using [path]\IDMAP14.txt, clear
;
label variable CHLDID14 "CHILD 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable CHLDSN14 "CHILD 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable PCGID14 "PCG 2013 PSID FAMILY IW (ID) NUMBER" ;
label variable PCGSN14 "PCG 2013 INDIVIDUAL SEQUENCE NUMBER" ;
label variable CDSHID14 "CDS 2014 HOUSEHOLD INTERVIEW NUMBER" ;
label variable CHLDINST14 "CDS 2014 HH ROSTER CHILD SEQUENCE NUM" ;
label variable PCGINST14 "CDS 2014 HH ROSTER PCG SEQUENCE NUM" ; '))