将文本文件转换为数据框R

Question

我知道这里有人问过类似的问题，但我仍然认为我的任务更复杂。

我有一个包含来自项目 geonames.org 的信息的文本文件，名为 MX.txt，其中数据排列如下：

MX  20158   Villas del Cobano   Aguascalientes  AGU Aguascalientes  
001      Aguascalientes 01  21.8495 -102.3052   1
MX  20158   Hacienda el Cobano  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01 21.8495  -102.3052   1
MX  20159   Alianza Ferrocarrilera  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20159   Bosques del Prado Oriente   Aguascalientes  AGU Aguascalientes
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20160   Francisco Guel Jimenez  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20160   Las Viñas INFONAVIT Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20164   Santa Anita 4a Sección  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1

文件有几千行。

我想将其转换为包含 12 个变量的数据框，其中像 "Villas del Cobano" 这样的字符串只是一个条目，如下所示：

V1  V2      V3                  V4              V5  V6
MX  20158   Villas del Cobano   Aguascalientes  AGU Aguascalientes  
V7  V8              V9  V10     V11         V12
001 Aguascalientes  01  21.8495 -102.3052   1
V1  V2      V3                  V4              V5  V6
MX  20158   Hacienda el Cobano  Aguascalientes  AGU Aguascalientes
V7  V8              V9  V10     V11         V12 
001 Aguascalientes  01 21.8495  -102.3052   1

我已经尝试过之前在此处发布的答案，例如： Converting text file into data frame in R , converting multiple lines of text into a data frame

因为英语不是我的第一语言，如果我的问题不够清楚，我想通过评论部分回答问题，而不是得到负面标记。

提前致谢！

Answer 1

列之间的分隔符是制表符，那就用

data <- read.table(file="MX.txt", sep="\t", quote="", comment.char="")

geonames 数据存在问题。有时他们在地名中使用 #。默认情况下 read.table 读取为丢弃该行其余部分的注释，因此您需要设置 comment.char="".

Answer 2

我提出了一个冗长的解决方案，可能会得到你想要的。简而言之，我使用每个嵌套列表的开头和结尾的已知距离来隔离 "multipart name"，将其连接起来，并将其作为其他数据中的一列输入。

splitAt函数来自R split numeric vector at position.

#Support functions
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
extractplace <- function(x) {
  len <- length(x)
  place0 <- x[-1*c(1:2,(len-8):len)]
  place <- paste(place0, collapse=" ")
}
extractother <- function(x) {
  len <- length(x)
  other <- x[c(1:2,(len-8):len)]
}

#initital data processing
elems <- scan(file="mx.txt", what="list") #creates a vector of all elements in your txt file
inds <- grep(pattern="MX", elems) #finds indices of "MX", which starts every nested list
lists <- splitAt(elems, inds) #creates a list of nested list

#create the matrix you want
placevector <- sapply(lists, function(x) extractplace(x)) #vector of multipart names
othermatrix <- t(sapply(lists, function(x) extractother(x))) #matrix of remaining data
fullmatrix <- cbind(othermatrix[,1:2],placevector,othermatrix[,3:11]) #inserts multipart names in matrix
colnames(fullmatrix) <- paste("V",1:12, sep="")

fullmatrix

Answer 3

这假设其余数据与此数据相似。我不得不做很多清洁工作（即 gsubing）：

代码：

vect <- unlist(Map(function(x, y) paste(x, y), dat[c(T, F)], dat[c(F, T)]), 
    use.names = FALSE)
read.table(text=gsub("\s{2,}", ", ", gsub("(\s)(\d{2,})", "  \2", 
    gsub("(\d{2,}|[A-Z]+)\s+", "\1  ", vect))), sep=",")

方便阅读的数据：

dat <- readLines(n=14)
MX  20158   Villas del Cobano   Aguascalientes  AGU Aguascalientes  
001      Aguascalientes 01  21.8495 -102.3052   1
MX  20158   Hacienda el Cobano  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01 21.8495  -102.3052   1
MX  20159   Alianza Ferrocarrilera  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20159   Bosques del Prado Oriente   Aguascalientes  AGU Aguascalientes
001 Aguascalientes  01  21.8495 -102.3052   1
MX  20160   Francisco Guel Jimenez  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20160   Las Viñas INFONAVIT Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1
MX  20164   Santa Anita 4a Sección  Aguascalientes  AGU Aguascalientes  
001 Aguascalientes  01  21.7561 -102.305    1

将文本文件转换为数据框R

Converting text file to data frame R

r

data-manipulation

bigdata