在 R 中导入带有 CRLF 断线的 CSV 文件

Importing CSV files with CRLF broken lines in R

我是一名正在转向空间数据分析的城市规划师。一般来说,我并没有忘记 R 和编程,但由于我没有接受过适当的培训,我的技能有时会受到限制。

目前我正在尝试分析大约 50 个 CSV 文件,其中包含有关 public 拍卖的财务数据,这些文件的长度从 60000 到 300000 行,共 39 个字段。这些文件是从罗马尼亚国家 public 拍卖系统导出的,这是一个类似表格的平台。

问题是某些行被地址字段中间的 CRLF 行结尾断开。我怀疑当人们以 copy/pasted 从其他多行文件中输入地址的形式输入地址时。

查找和替换无法解决该问题,因为这也会替换行尾的正确 CRLF

例如,数据格式如下,每行后有一个 CRLF(他们使用 ^ 作为分隔符):

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; 
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

为了正确处理数据,我需要像这样读取 CSV,方法是仅删除断行的 CRLF - Find&Replace 无法做到:

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1 Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

我找到了一个可行的解决方案 (),但它需要进行一些调整才能满足我的需要。最终结果是下面的代码挂起并且没有到达进程的末尾,即使是在小样本文件上也是如此。

我对上述post接受的解决方案代码的更改:

dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "^", fixed = TRUE)) # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
  k <- k + 1
  if(dat[k] == "") {k <- k + 1
  print(paste("data line", k, "is an empty string"))
  if(k > length(dat)) {break}
  }
  temp <- dat[k]
  # checks if there are enough commas or if the line was broken
  while(length(gregexpr("^", temp)[[1]]) < nvar-1){
    k <- k + 1
    temp <- paste0(temp, dat[k])
  }
  temp <- unlist(strsplit(temp, "^"))
  message(k)
  dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

计算定界符之间的字段似乎是一个很好的解决方案,但我无法找到一个好的方法来做到这一点,而且我的 R 编程技能显然还不够。

那么有什么方法可以修复 R 中这种损坏的 CSV 文件吗?

可在此处访问工作文件示例:http://data.gv.ro/dataset/4a4903c4-b1e3-46d1-82a5-238287f9496c/resource/c6abc0ef-3efb-4aef-bc0a-411f8cab2a28/download/contracte-2007.csv

感谢您的帮助!

我们可以通过检查它是否以数字字段结尾来确定每条记录的最后一行。然后使用 cumsum 我们可以使用 1, 2, 3, ... 标记同一记录中的行。最后将它们粘贴在一起。

# test data
Lines <- "Name^FiscCode^Country^Adress^SomeData^
SomeCompany^235356^Romania^Adress1
Adress2^ 565863
SomeCompany^235356^Romania^Adress1^ 565863"

# for real problem use readLines("myfile")[-1]
L <- readLines(textConnection(Lines))[-1]

g <- rev(cumsum(rev(grepl("\^ *\d+$", L)))) ##
g <- max(g) - g + 1
L2 <- tapply(L, g, paste, collapse = " ")
read.table(text = L2, sep = "^")

以上适用于问题中显示的数据,但如果实际数据与您显示的数据存在差异,则可能需要根据这些差异进行一些修改。

注意:如果每条记录中始终有四个 ^ 字符,请尝试将标记为 ## 的行替换为:

cnt <- count.fields(textConnection(L), sep = "^") - 1
g <- rev(cumsum(rev(cumsum(cnt) %% 4 == 0)))

更新问题已更改以提供新的示例数据。请注意,发布的答案适用于它,但当然您需要将 4 替换为 38,因为新数据每条记录有 38 个分隔符,而旧数据有 4 个。此外,旧数据有 header 而新数据有不是这样,我们已经删除了那些用于删除 header 的 -1。这是一个可以复制并粘贴到 R 中的独立示例。

Lines <- "Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; 
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^"

L <- readLines(textConnection(Lines))

cnt <- count.fields(textConnection(L), sep = "^") - 1   # 38 4 34 4 34
g <- rev(cumsum(rev(cumsum(cnt) %% 38 == 0)))
g <- max(g) - g + 1   # 1 2 2 3 3
L2 <- tapply(L, g, paste, collapse = " ")
DF <- read.table(text = L2, sep = "^")
dim(DF)
## [1]  3 39

示例数据不包含注释字符 (#) 或单引号或双引号,但如果确实包含这些是其数据的一部分,则将 comment.char = "", quote = "" 添加到 count.fieldsread.table 需要调用。

问题似乎出在 ^ 是一个特殊字符。如果您单步执行您的代码,您将看到您有 627 个变量而不是 39 个。它使每个字符成为一个变量。试试这个:

dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "\^"))  # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
  k <- k + 1
  #if(dat[k] == "") {k <- k + 1
  #print(paste("data line", k, "is an empty string"))
  if(k > length(dat)) {break}
  #}
  temp <- dat[k]
  # checks if there are enough commas or if the line was broken
  while(length(gregexpr("\^", temp)[[1]]) < nvar-1){
    k <- k + 1
    temp <- paste0(temp, dat[k])
  }
  temp <- unlist(strsplit(temp, "\^"))
  message(k)
  dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

抱歉错过了你和我的代码中的差异。你不想要 fixed=true。将其更改为上面的内容会得到以下信息:

> varnames
 [1] "Castigator"                       "CastigatorCUI"                    "CastigatorTara"                  
 [4] "CastigatorLocalitate"             "CastigatorAdresa"                 "Tip"                             
 [7] "TipContract"                      "TipProcedura"                     "AutoritateContractanta"          
[10] "AutoritateContractantaCUI"        "TipAC"                            "TipActivitateAC"                 
[13] "NumarAnuntAtribuire"              "DataAnuntAtribuire"               "TipIncheiereContract"            
[16] "TipCriteriiAtribuire"             "CuLicitatieElectronica"           "NumarOfertePrimite"              
[19] "Subcontractat"                    "NumarContract"                    "DataContract"                    
[22] "TitluContract"                    "Valoare"                          "Moneda"                          
[25] "ValoareRON"                       "ValoareEUR"                       "CPVCodeID"                       
[28] "CPVCode"                          "NumarAnuntParticipare"            "DataAnuntParticipare"            
[31] "ValoareEstimataParticipare"       "MonedaValoareEstimataParticipare" "FonduriComunitare"               
[34] "TipFinantare"                     "TipLegislatieID"                  "FondEuropean"                    
[37] "ContractPeriodic"                 "DepoziteGarantii"                 "ModalitatiFinantare"