组合列表中不以数字开头的字符串,直到有一个带有数字的字符串?
Combine strings in list that don't start with a number until there is a string with a number?
我正在使用 tika 解析 pdf 文件,然后将输出转换为 txt 文件。这给了我一个很大的 txt 文件,行与行之间有很多空格。阅读文本文件后,我得到如下列表:
['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW ,700.00 LEW',
'2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP 0.00 TTN',
'3 HAZEN & SAWYER, P.C BEQ-5277-17-PW ,500.00 HAR',
'4',
'JOHN PUFF',
'CONSTRUCTION',
'COMPANY, INC',
'OEHRC-611-16-PBS ,500.00 ELM',
'5 HORIZON OWNERS, CORP OEHRC-601-17-PBS ,450.00 YON',
'6',
'ONE FRANKLIN',
'OWNERS, CORP. c/o',
'BENCHMARK LM',
'MANAGEMENT, SVCS',
'OEHRC-1204-17-PBS 0.00 WHP',
'7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE ,000.00 TTN',
'8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE ,000.00 YTN',
'9',
'JET ENTERPRISES',
'M13, LLC/ MOE’S',
'SOUTHWEST GRILL',
'PHP-6811-16-FSE ,250.00 YTN',
'10',
'TWO MEN & A LADY,',
'INC/ GORDON’S DELI',
'CAFÉ',
'PHP-6816-16-FSE ,300.00 CRO',
'11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE ,450.00 MTP']
但是,我想获得一个列表,其中列表中的每个字符串都以数字开头,并包含如下所示的其余信息,但我不知道该怎么做。
['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW ,700.00 LEW',
'2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP 0.00 TTN',
'3 HAZEN & SAWYER, P.C BEQ-5277-17-PW ,500.00 HAR',
'4 JOHN PUFF CONSTRUCTION COMPANY, INC OEHRC-611-16-PBS ,500.00 ELM',
'5 HORIZON OWNERS, CORP OEHRC-601-17-PBS ,450.00 YON',
'6 ONE FRANKLIN OWNERS, CORP. c/o BENCHMARK LM MANAGEMENT, SVCS OEHRC-1204-17-PBS 0.00 WHP',
'7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE ,000.00 TTN',
'8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE ,000.00 YTN',
'9 JET ENTERPRISES M13, LLC/ MOE’S SOUTHWEST GRILL PHP-6811-16-FSE ,250.00 YTN',
'10 TWO MEN & A LADY INC/ GORDON’S DELI CAFÉ PHP-6816-16-FSE ,300.00 CRO',
'11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE ,450.00 MTP']
最终目标是将其转换为数据框。
id name code amount area
1 'ARBOR HILLS WATER WORKS, CORP' 'BEQ-5264-16-PW' ,700.00 'LEW'
2 'COFFEE LAB ROASTERS, INC' 'BEQ-5456-16-AP' 0.00 'TTN'
3 'HAZEN & SAWYER, P.C' 'BEQ-5277-17-PW' ,500.00 'HAR'
我认为进入数据框格式不会太难,但我不知道如何进入我需要的列表格式。谢谢
嗯,基本原理是创建一个新列表,复制项目,如果新项目不是以数字开头,则将其附加到上一个项目。
newlist = []
for row in oldlist:
if row[0].isdigit():
newlist.append( row )
else:
newlist[-1] += ' ' + row
我正在使用 tika 解析 pdf 文件,然后将输出转换为 txt 文件。这给了我一个很大的 txt 文件,行与行之间有很多空格。阅读文本文件后,我得到如下列表:
['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW ,700.00 LEW',
'2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP 0.00 TTN',
'3 HAZEN & SAWYER, P.C BEQ-5277-17-PW ,500.00 HAR',
'4',
'JOHN PUFF',
'CONSTRUCTION',
'COMPANY, INC',
'OEHRC-611-16-PBS ,500.00 ELM',
'5 HORIZON OWNERS, CORP OEHRC-601-17-PBS ,450.00 YON',
'6',
'ONE FRANKLIN',
'OWNERS, CORP. c/o',
'BENCHMARK LM',
'MANAGEMENT, SVCS',
'OEHRC-1204-17-PBS 0.00 WHP',
'7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE ,000.00 TTN',
'8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE ,000.00 YTN',
'9',
'JET ENTERPRISES',
'M13, LLC/ MOE’S',
'SOUTHWEST GRILL',
'PHP-6811-16-FSE ,250.00 YTN',
'10',
'TWO MEN & A LADY,',
'INC/ GORDON’S DELI',
'CAFÉ',
'PHP-6816-16-FSE ,300.00 CRO',
'11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE ,450.00 MTP']
但是,我想获得一个列表,其中列表中的每个字符串都以数字开头,并包含如下所示的其余信息,但我不知道该怎么做。
['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW ,700.00 LEW',
'2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP 0.00 TTN',
'3 HAZEN & SAWYER, P.C BEQ-5277-17-PW ,500.00 HAR',
'4 JOHN PUFF CONSTRUCTION COMPANY, INC OEHRC-611-16-PBS ,500.00 ELM',
'5 HORIZON OWNERS, CORP OEHRC-601-17-PBS ,450.00 YON',
'6 ONE FRANKLIN OWNERS, CORP. c/o BENCHMARK LM MANAGEMENT, SVCS OEHRC-1204-17-PBS 0.00 WHP',
'7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE ,000.00 TTN',
'8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE ,000.00 YTN',
'9 JET ENTERPRISES M13, LLC/ MOE’S SOUTHWEST GRILL PHP-6811-16-FSE ,250.00 YTN',
'10 TWO MEN & A LADY INC/ GORDON’S DELI CAFÉ PHP-6816-16-FSE ,300.00 CRO',
'11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE ,450.00 MTP']
最终目标是将其转换为数据框。
id name code amount area
1 'ARBOR HILLS WATER WORKS, CORP' 'BEQ-5264-16-PW' ,700.00 'LEW'
2 'COFFEE LAB ROASTERS, INC' 'BEQ-5456-16-AP' 0.00 'TTN'
3 'HAZEN & SAWYER, P.C' 'BEQ-5277-17-PW' ,500.00 'HAR'
我认为进入数据框格式不会太难,但我不知道如何进入我需要的列表格式。谢谢
嗯,基本原理是创建一个新列表,复制项目,如果新项目不是以数字开头,则将其附加到上一个项目。
newlist = []
for row in oldlist:
if row[0].isdigit():
newlist.append( row )
else:
newlist[-1] += ' ' + row