读取没有分隔符且宽度不规则的txt文件
Reading txt files with no delimiter and irregular width
我正在尝试读入大量文本文件并整理成数据框。这些文件不包含定界符,由于包含一些包含与后续行集对应的数据的行而具有不规则宽度。
这里有 2 个示例:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0326806 WASHOUT BASIN 1.000 L SUM
1216 Copenhaver Construction, Inc. 1,000.0000 1,000.00 1,000.00
1320 D. Construction, Inc. 1,500.0000 1,500.00 1,500.00
3069 K-Five Construction Corporation 1,000.0000 1,000.00 1,000.00
3702 Martam Construction Incorporated 1,500.0000 1,500.00 1,500.00
4741 Phoenix Corporation of the Quad Cities 5,000.0000 5,000.00 5,000.00
4786 Pir Tano Construction Company, Inc. 1,200.0000 1,200.00 1,200.00
1560 R. W. Dunteman Company 450.0000 450.00 450.00
5378 Schroeder Asphalt Services, Inc. 5,100.0000 5,100.00 5,100.00
X0327036 BIKE PATH REM 120.000 SQ YD
1216 Copenhaver Construction, Inc. 16.0000 1,920.00 1,920.00
1320 D. Construction, Inc. 20.0000 2,400.00 2,400.00
3069 K-Five Construction Corporation 5.0000 600.00 600.00
3702 Martam Construction Incorporated 10.0000 1,200.00 1,200.00
4741 Phoenix Corporation of the Quad Cities 14.0000 1,680.00 1,680.00
4786 Pir Tano Construction Company, Inc. 32.0000 3,840.00 3,840.00
1560 R. W. Dunteman Company 12.8400 1,540.80 1,540.80
5378 Schroeder Asphalt Services, Inc. 18.0000 2,160.00 2,160.00
此处还有另一个文件:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00
我对使用 R 或 Python 持开放态度,并尝试了多种使用 base R、readr 和 data.table 以及 pandas 和使用 open 循环遍历行的方法() 收效甚微。我的定界符使用错误,因为我的结果要么将每个 space 解析成一列,要么给我一个包含每一行所有内容的列。
有没有一种干净的方法来完成这个?谢谢
试试这个
import pandas as pd
data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)
这是工作流程:
- 用多个换行符拆分文本(并处理列表中的所有项目,除了第一个只包含 headers 的项目)
- 使用pandas read_fwf通过识别列的fixed-width字段来读取第一行(项目数据)作为数据帧
- 对文本的其余部分(投标人数据)执行相同的操作
- 连接两个数据帧并附加到列表中
- 将列表中的所有数据帧连接到一个 df
代码:
import re
import pandas as pd
data = '''
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00'''
#with open('filename.txt') as f:
# data = f.read()
tables = [i for i in re.split(r'\n\n+', data)[1:] if i]
dfs= []
for i in tables:
item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])
headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
df = pd.concat(dfs)
输出:
ITEM NBR
ITEM DESCRIPTION
QUANTITY
UNIT OF MEASURE
BIDR NBR
BIDDER NAME
UNIT PRICE
BIDDER EXTENSION
CALCULATED EXTENSION
0
X0320050
CONSTRUCTN LAYOUT SPL
1
L SUM
2341
Builders Paving, LLC
5000
5000
5000
1
X0320050
CONSTRUCTN LAYOUT SPL
1
L SUM
3020
J. A. Johnson Paving Company
5000
5000
5000
2
X0320050
CONSTRUCTN LAYOUT SPL
1
L SUM
280
Peter Baker & Son Co.
1500
1500
1500
0
X0327611
REM & REIN BRIC PAVER
55
SQ FT
2341
Builders Paving, LLC
20
1100
1100
1
X0327611
REM & REIN BRIC PAVER
55
SQ FT
3020
J. A. Johnson Paving Company
40
2200
2200
2
X0327611
REM & REIN BRIC PAVER
55
SQ FT
280
Peter Baker & Son Co.
20
1100
1100
定义一个函数Read,读取一组行。它删除空行,将每个组的第一个剩余行与其他行组合并将其解析为数据框。我们使用字母向量来定义列名,但您可以将其替换为您喜欢的任何内容。
现在读取输入文件,trim 在前 4 行中删除末尾的空白,将其分成组,对每个组应用读取,然后将各个数据帧组合成一个整体数据帧.
library(magrittr) # use pipes
library(readr) # read_lines, read_delim
Read <- function(x) x %>%
Filter(nzchar, .) %>%
{ paste(.[-1], "", .[1])) } %>%
gsub(" +", ";", .) %>%
I %>%
read_delim(delim = ";", col_names = letters)
DF <- "data.txt" %>%
read_lines(skip = 4) %>%
trimws %>%
by(cumsum(!nzchar(.)), Read) %>%
do.call("rbind", .)
备注
创建用于测试的输入文件。
Lines <- " ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00"
writeLines(Lines, "data.txt")
更新
简化代码。
我正在尝试读入大量文本文件并整理成数据框。这些文件不包含定界符,由于包含一些包含与后续行集对应的数据的行而具有不规则宽度。
这里有 2 个示例:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0326806 WASHOUT BASIN 1.000 L SUM
1216 Copenhaver Construction, Inc. 1,000.0000 1,000.00 1,000.00
1320 D. Construction, Inc. 1,500.0000 1,500.00 1,500.00
3069 K-Five Construction Corporation 1,000.0000 1,000.00 1,000.00
3702 Martam Construction Incorporated 1,500.0000 1,500.00 1,500.00
4741 Phoenix Corporation of the Quad Cities 5,000.0000 5,000.00 5,000.00
4786 Pir Tano Construction Company, Inc. 1,200.0000 1,200.00 1,200.00
1560 R. W. Dunteman Company 450.0000 450.00 450.00
5378 Schroeder Asphalt Services, Inc. 5,100.0000 5,100.00 5,100.00
X0327036 BIKE PATH REM 120.000 SQ YD
1216 Copenhaver Construction, Inc. 16.0000 1,920.00 1,920.00
1320 D. Construction, Inc. 20.0000 2,400.00 2,400.00
3069 K-Five Construction Corporation 5.0000 600.00 600.00
3702 Martam Construction Incorporated 10.0000 1,200.00 1,200.00
4741 Phoenix Corporation of the Quad Cities 14.0000 1,680.00 1,680.00
4786 Pir Tano Construction Company, Inc. 32.0000 3,840.00 3,840.00
1560 R. W. Dunteman Company 12.8400 1,540.80 1,540.80
5378 Schroeder Asphalt Services, Inc. 18.0000 2,160.00 2,160.00
此处还有另一个文件:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00
我对使用 R 或 Python 持开放态度,并尝试了多种使用 base R、readr 和 data.table 以及 pandas 和使用 open 循环遍历行的方法() 收效甚微。我的定界符使用错误,因为我的结果要么将每个 space 解析成一列,要么给我一个包含每一行所有内容的列。
有没有一种干净的方法来完成这个?谢谢
试试这个
import pandas as pd
data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)
这是工作流程:
- 用多个换行符拆分文本(并处理列表中的所有项目,除了第一个只包含 headers 的项目)
- 使用pandas read_fwf通过识别列的fixed-width字段来读取第一行(项目数据)作为数据帧
- 对文本的其余部分(投标人数据)执行相同的操作
- 连接两个数据帧并附加到列表中
- 将列表中的所有数据帧连接到一个 df
代码:
import re
import pandas as pd
data = '''
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00'''
#with open('filename.txt') as f:
# data = f.read()
tables = [i for i in re.split(r'\n\n+', data)[1:] if i]
dfs= []
for i in tables:
item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])
headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
df = pd.concat(dfs)
输出:
ITEM NBR | ITEM DESCRIPTION | QUANTITY | UNIT OF MEASURE | BIDR NBR | BIDDER NAME | UNIT PRICE | BIDDER EXTENSION | CALCULATED EXTENSION | |
---|---|---|---|---|---|---|---|---|---|
0 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 2341 | Builders Paving, LLC | 5000 | 5000 | 5000 |
1 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 3020 | J. A. Johnson Paving Company | 5000 | 5000 | 5000 |
2 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 280 | Peter Baker & Son Co. | 1500 | 1500 | 1500 |
0 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 2341 | Builders Paving, LLC | 20 | 1100 | 1100 |
1 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 3020 | J. A. Johnson Paving Company | 40 | 2200 | 2200 |
2 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 280 | Peter Baker & Son Co. | 20 | 1100 | 1100 |
定义一个函数Read,读取一组行。它删除空行,将每个组的第一个剩余行与其他行组合并将其解析为数据框。我们使用字母向量来定义列名,但您可以将其替换为您喜欢的任何内容。
现在读取输入文件,trim 在前 4 行中删除末尾的空白,将其分成组,对每个组应用读取,然后将各个数据帧组合成一个整体数据帧.
library(magrittr) # use pipes
library(readr) # read_lines, read_delim
Read <- function(x) x %>%
Filter(nzchar, .) %>%
{ paste(.[-1], "", .[1])) } %>%
gsub(" +", ";", .) %>%
I %>%
read_delim(delim = ";", col_names = letters)
DF <- "data.txt" %>%
read_lines(skip = 4) %>%
trimws %>%
by(cumsum(!nzchar(.)), Read) %>%
do.call("rbind", .)
备注
创建用于测试的输入文件。
Lines <- " ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00"
writeLines(Lines, "data.txt")
更新
简化代码。