组合列表中不以数字开头的字符串,直到有一个带有数字的字符串?

Combine strings in list that don't start with a number until there is a string with a number?

我正在使用 tika 解析 pdf 文件,然后将输出转换为 txt 文件。这给了我一个很大的 txt 文件,行与行之间有很多空格。阅读文本文件后,我得到如下列表:

 ['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW ,700.00 LEW',
 '2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP 0.00 TTN',
 '3 HAZEN & SAWYER, P.C BEQ-5277-17-PW ,500.00 HAR',
 '4',
 'JOHN PUFF',
 'CONSTRUCTION',
 'COMPANY, INC',
 'OEHRC-611-16-PBS ,500.00 ELM',
 '5 HORIZON OWNERS, CORP OEHRC-601-17-PBS ,450.00 YON',
 '6',
 'ONE FRANKLIN',
 'OWNERS, CORP. c/o',
 'BENCHMARK LM',
 'MANAGEMENT, SVCS',
 'OEHRC-1204-17-PBS 0.00 WHP',
 '7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE ,000.00 TTN',
 '8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE ,000.00 YTN',
 '9',
 'JET ENTERPRISES',
 'M13, LLC/ MOE’S',
 'SOUTHWEST GRILL',
 'PHP-6811-16-FSE ,250.00 YTN',
 '10',
 'TWO MEN & A LADY,',
 'INC/ GORDON’S DELI',
 'CAFÉ',
 'PHP-6816-16-FSE ,300.00 CRO',
 '11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE ,450.00 MTP']

但是,我想获得一个列表,其中列表中的每个字符串都以数字开头,并包含如下所示的其余信息,但我不知道该怎么做。

['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW ,700.00 LEW',
 '2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP 0.00 TTN',
 '3 HAZEN & SAWYER, P.C BEQ-5277-17-PW ,500.00 HAR',
 '4 JOHN PUFF CONSTRUCTION COMPANY, INC OEHRC-611-16-PBS ,500.00 ELM',
 '5 HORIZON OWNERS, CORP OEHRC-601-17-PBS ,450.00 YON',
 '6 ONE FRANKLIN OWNERS, CORP. c/o BENCHMARK LM MANAGEMENT, SVCS OEHRC-1204-17-PBS 0.00 WHP',
 '7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE ,000.00 TTN',
 '8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE ,000.00 YTN',
 '9 JET ENTERPRISES M13, LLC/ MOE’S SOUTHWEST GRILL PHP-6811-16-FSE ,250.00 YTN',
 '10 TWO MEN & A LADY INC/ GORDON’S DELI CAFÉ PHP-6816-16-FSE ,300.00 CRO',
 '11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE ,450.00 MTP']

最终目标是将其转换为数据框。

id         name                         code             amount       area
1   'ARBOR HILLS WATER WORKS, CORP'  'BEQ-5264-16-PW'   ,700.00     'LEW'
2   'COFFEE LAB ROASTERS, INC'       'BEQ-5456-16-AP'   0.00       'TTN'
3   'HAZEN & SAWYER, P.C'            'BEQ-5277-17-PW'   ,500.00     'HAR'

我认为进入数据框格式不会太难,但我不知道如何进入我需要的列表格式。谢谢

嗯,基本原理是创建一个新列表,复制项目,如果新项目不是以数字开头,则将其附加到上一个项目。

newlist = []
for row in oldlist:
    if row[0].isdigit():
        newlist.append( row )
    else:
        newlist[-1] += ' ' + row