如何读取文本文件并在 R 中创建数据框

how to read text files and create a data frame in R

需要读取里面的txt文件 https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt

并将它们转换为数据框 R,列号为:LastName、FirstName、streetno、streetname、city、state 和 zip...

尝试使用 sep 命令将它们分开但失败了...

试试这个。

x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" , 
  what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))

data<-as.data.frame(x)

这里您的问题不是如何使用 R 读取此数据,而是您的数据在作为输入的可变长度字段之间使用常规分隔符的结构不够充分。此外,邮政编码字段包含一些应为“0”的字母 "O" 字符。

所以这是一种使用正则表达式替换来添加分隔符,然后使用read.csv()解析分隔文本的方法。请注意,根据整组文本中的异常情况,您可能需要调整正则表达式。我在这里一步一步地完成它们是为了清楚地说明正在做什么,以便您可以在发现输入文本中的异常时调整它们。 (例如,一些城市名称如“Wms.Bay”是两个单词。)

addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\s+O(\d{4})", " 0\1", addr.txt)       # replace O with 0 in zip
addr.txt <- gsub("(\s+)([A-Z]{2})", ", \2", addr.txt)    # state
addr.txt <- gsub("\s+(\d{5}(\-\d{4}){0,1})\s*", ", \1", addr.txt) # zip
addr.txt <- gsub("\s+(\d{1,4})\s", ", \1, ", addr.txt) # streetno
addr.txt <- gsub("(^\w*)(\s+)", "\1, ", addr.txt)       # LastName (FirstName)
addr.txt <- gsub("\s{2,}", ", ", addr.txt)                # city, by elimination

addr <- read.csv(textConnection(addr.txt), header = FALSE,
                 col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
                 stringsAsFactors = FALSE)
head(addr)
##     LastName   FirstName streetno         streetname      city state    zip
## 1      Bania   Thomas M.      725  Commonwealth Ave.    Boston    MA  02215
## 2    Barnaby       David      373      W. Geneva St.  Wms. Bay    WI  53191
## 3     Bausch        Judy      373      W. Geneva St.  Wms. Bay    WI  53191
## 4    Bolatto     Alberto      725  Commonwealth Ave.    Boston    MA  02215
## 5  Carlstrom        John      933        E. 56th St.   Chicago    IL  60637
## 6 Chamberlin  Richard A.      111         Nowelo St.      Hilo    HI  96720

扩展我的评论,这是另一种方法。如果您的完整数据集有更广泛的模式需要考虑,您可能需要调整一些代码。

library(stringr) # For str_trim 

# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")

# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\1",  dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\1",  dat$address)

# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)

dat = dat[,c(1:2,7:8,4:6)]

dat
      LastName  FirstName streetno           streetname       city state        zip
1        Bania  Thomas M.      725    Commonwealth Ave.     Boston    MA      02215
2      Barnaby      David      373        W. Geneva St.   Wms. Bay    WI      53191
3       Bausch       Judy      373        W. Geneva St.   Wms. Bay    WI      53191
...
41      Wright       Greg      791  Holmdel-Keyport Rd.    Holmdel    NY 07733-1988
42     Zingale    Michael     5640        S. Ellis Ave.    Chicago    IL      60637

我发现通过在它们所属的位置添加逗号来将文件修复为 csv 最简单,然后读取它。

## get the page as text
txt <- RCurl::getURL(
    "https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
    ## add most comma-separators, then the last for the house number
    text = gsub("(\d+) (\D+)", "\1,\2", gsub("\s{2,}", ",", g1)), 
    header = FALSE,
    ## set the column names
    col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
#     LastName  FirstName streetno        streetname     city state   zip
# 1      Bania  Thomas M.      725 Commonwealth Ave.   Boston    MA O2215
# 2    Barnaby      David      373     W. Geneva St. Wms. Bay    WI 53191
# 3     Bausch       Judy      373     W. Geneva St. Wms. Bay    WI 53191
# 4    Bolatto    Alberto      725 Commonwealth Ave.   Boston    MA O2215
# 5  Carlstrom       John      933       E. 56th St.  Chicago    IL 60637
# 6 Chamberlin Richard A.      111        Nowelo St.     Hilo    HI 96720