读取没有分隔符且宽度不规则的txt文件

Reading txt files with no delimiter and irregular width

我正在尝试读入大量文本文件并整理成数据框。这些文件不包含定界符,由于包含一些包含与后续行集对应的数据的行而具有不规则宽度。

这里有 2 个示例:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0326806  WASHOUT BASIN                                1.000    L SUM
 1216      Copenhaver Construction, Inc.                                1,000.0000        1,000.00        1,000.00
 1320      D. Construction, Inc.                                        1,500.0000        1,500.00        1,500.00
 3069      K-Five Construction Corporation                              1,000.0000        1,000.00        1,000.00
 3702      Martam Construction Incorporated                             1,500.0000        1,500.00        1,500.00
 4741      Phoenix Corporation of the Quad Cities                       5,000.0000        5,000.00        5,000.00
 4786      Pir Tano Construction Company, Inc.                          1,200.0000        1,200.00        1,200.00
 1560      R. W. Dunteman Company                                         450.0000          450.00          450.00
 5378      Schroeder Asphalt Services, Inc.                             5,100.0000        5,100.00        5,100.00

    X0327036  BIKE PATH REM                              120.000    SQ YD
 1216      Copenhaver Construction, Inc.                                   16.0000        1,920.00        1,920.00
 1320      D. Construction, Inc.                                           20.0000        2,400.00        2,400.00
 3069      K-Five Construction Corporation                                  5.0000          600.00          600.00
 3702      Martam Construction Incorporated                                10.0000        1,200.00        1,200.00
 4741      Phoenix Corporation of the Quad Cities                          14.0000        1,680.00        1,680.00
 4786      Pir Tano Construction Company, Inc.                             32.0000        3,840.00        3,840.00
 1560      R. W. Dunteman Company                                          12.8400        1,540.80        1,540.80
 5378      Schroeder Asphalt Services, Inc.                                18.0000        2,160.00        2,160.00


                           

此处还有另一个文件:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00

我对使用 R 或 Python 持开放态度,并尝试了多种使用 base R、readr 和 data.table 以及 pandas 和使用 open 循环遍历行的方法() 收效甚微。我的定界符使用错误,因为我的结果要么将每个 space 解析成一列,要么给我一个包含每一行所有内容的列。

有没有一种干净的方法来完成这个?谢谢

试试这个

import pandas as pd

data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)

这是工作流程:

  1. 用多个换行符拆分文本(并处理列表中的所有项目,除了第一个只包含 headers 的项目)
  2. 使用pandas read_fwf通过识别列的fixed-width字段来读取第一行(项目数据)作为数据帧
  3. 对文本的其余部分(投标人数据)执行相同的操作
  4. 连接两个数据帧并附加到列表中
  5. 将列表中的所有数据帧连接到一个 df

代码:

import re
import pandas as pd

data = '''
    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00'''

#with open('filename.txt') as f:
#    data = f.read()

tables = [i for i in re.split(r'\n\n+', data)[1:] if i]

dfs= []
for i in tables:
    item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])

    headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
    colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
    buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
    dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
    
df = pd.concat(dfs)

输出:

ITEM NBR ITEM DESCRIPTION QUANTITY UNIT OF MEASURE BIDR NBR BIDDER NAME UNIT PRICE BIDDER EXTENSION CALCULATED EXTENSION
0 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 2341 Builders Paving, LLC 5000 5000 5000
1 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 3020 J. A. Johnson Paving Company 5000 5000 5000
2 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 280 Peter Baker & Son Co. 1500 1500 1500
0 X0327611 REM & REIN BRIC PAVER 55 SQ FT 2341 Builders Paving, LLC 20 1100 1100
1 X0327611 REM & REIN BRIC PAVER 55 SQ FT 3020 J. A. Johnson Paving Company 40 2200 2200
2 X0327611 REM & REIN BRIC PAVER 55 SQ FT 280 Peter Baker & Son Co. 20 1100 1100

定义一个函数Read,读取一组行。它删除空行,将每个组的第一个剩余行与其他行组合并将其解析为数据框。我们使用字母向量来定义列名,但您可以将其替换为您喜欢的任何内容。

现在读取输入文件,trim 在前 4 行中删除末尾的空白,将其分成组,对每个组应用读取,然后将各个数据帧组合成一个整体数据帧.

library(magrittr)  # use pipes
library(readr) # read_lines, read_delim

Read <- function(x) x %>%
  Filter(nzchar, .) %>%
  { paste(.[-1], "", .[1])) } %>%
  gsub("  +", ";", .) %>%
  I %>%
  read_delim(delim = ";", col_names = letters)

DF <- "data.txt" %>%
   read_lines(skip = 4) %>%
   trimws %>%
   by(cumsum(!nzchar(.)), Read) %>%
   do.call("rbind", .)

备注

创建用于测试的输入文件。

Lines <- "    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00"
writeLines(Lines, "data.txt")

更新

简化代码。