使用 python 删除 .csv 文件中多余的逗号、space 和行偏移

Remove extra commas, space & lines offset in .csv file using python

我有一个 5 页的 pdf 文件,每页都有一个我需要提取的 table。我需要从每个页面中提取所有 tables 并使用 python 将它们保存为数据帧文件,所以我使用 tabula[= 将文件转换为 csv 文件39=]

tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')

文件 output.csv 的主要问题是有 几个额外的逗号 .

例子

Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361

当我将 csv 文件转换为 row/columns 时,我得到了一些行偏移量

检查下图以了解问题: 正如您在图像中看到的那样,有一些行偏移(文件每一页中的每个 table 都有特定的行偏移)我该如何解决这个问题

注意: 数据框应该有 6 列空字段。 我猜额外的逗号来自 pdf 文件中的 space 。如何从 csv 文件中删除多余的逗号或在 pdf 文件中删除多余的 space。

下图中的预期输出:

非常感谢您的帮助。

将 CSV 内容加载到 dataframe 后,删除第三列,您将获得所需格式的数据。

注意:我没有在此处添加任何列名。您可以在删除列后稍后添加它们

import pandas as pd

l = ['181,ALEX,,,,20,987', '182,Julia,,18,79,98,8.390', '183,Marian,,21,89,70,9.170', '184,Julien,,,,60,9.095']

df = pd.DataFrame([sub.split(",") for sub in l])

df.drop(2, inplace=True, axis=1)
print(df)

Output:

     0       1   3   4   5      6
0  181    ALEX          20    987
1  182   Julia  18  79  98  8.390
2  183  Marian  21  89  70  9.170
3  184  Julien          60  9.095

以下方法可能有效,但并不理想:

import pandas as pd
import csv

data = []

with open('output.csv') as f_input:
    csv_input = csv.reader(f_input)
    header = [v for v in next(csv_input) if v]      # Remove empty column names
    
    for row in csv_input:
        empty = row.index('')
        row = [v.replace(' ', '') for v in row if v]
        
        if row[0] != 'Id':
            row = row[:empty] + ['' for _ in range(6 - len(row))] + row[empty:]
            data.append(row)
        
df = pd.DataFrame(data, columns=header)
print(df)

给你:

     Id      Name   Age Score Rang   Bonus
0   181      ALEX               20     987
1   182     Julia               18   8.390
2   183    Marian               21   9.170
3   184    Julien     0   175   60   9.095
4   215      Asma    26    35   19   3.807
5   216      Juan               20   7.982
6   217      Rami               10   1.832
7   415   Jessica  4920  8873  538   7.994
8   416     Karen   890     6   12   9.993
9   417    Andrea     0    69  283   7.200
10  419       Rym    10    18   10   7.196
11  420      Noor    10    70  910   8.291
12  421  Nathalie     0     5    0   0.900
13  456       Joe    10    13    0  74.917
14  457     Loula     0    18   11   9.990
15  458     Maria     0    15  172   6.425
16  459      Carl    15    17   11   3.349
17  566     Diego                0   3.680
18  567     Carla     0    26    1  19.361

它的工作原理是删除所有空白条目,然后在第一个空白条目返回 6 个值之后填充剩余条目。由于年龄列似乎是可选的,它可能不是 100% 可靠。

我发现这比

更容易理解

这是一个生成器,生成与清理后的第一行长度相同的行。并删除第一个空字符串,直到一行具有正确的长度。

就像 Martin 的回答一样,它会根据您的示例数据生成您预期的数据框。

import pandas as pd
from io import StringIO
import csv

f = StringIO("""Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361""")


def clean_up(csv_file):
    header = None
    for line in csv_file:
        if not header:
            header = [v for v in line if v]
            length = len(header)
            continue
        while len(line) > length:
            line.remove('')
        if line != header:
            yield(dict(zip(header,line)))

df = pd.DataFrame(clean_up(csv.reader(f)))
print(df)

这给你:

     Id      Name    Age  Score Rang   Bonus
0   181      ALEX                 20     987
1   182     Julia                 18   8.390
2   183    Marian                 21   9.170
3   184    Julien      0    175   60   9.095
4   215      Asma     26     35   19   3.807
5   216      Juan                 20   7.982
6   217      Rami                 10   1.832
7   415   Jessica  4 920  8 873  538   7.994
8   416     Karen    890      6   12   9.993
9   417    Andrea      0     69  283   7.200
10  419       Rym     10     18   10   7.196
11  420      Noor     10     70  910   8.291
12  421  Nathalie      0      5    0   0.900
13  456       Joe     10     13    0  74.917
14  457     Loula      0     18   11   9.990
15  458     Maria      0     15  172   6.425
16  459      Carl     15     17   11   3.349
17  566     Diego                  0   3.680
18  567     Carla      0     26    1  19.361

我的策略基于一个简短的正则表达式来捕获前两列和末尾的数字。

(\d+,[^,]+,) → numbers + comma + anything but comma + comma
,*           → zero or more commas
(\d.+)       → the rest of the line starting from the first number

然后我将这两个组连接起来,在中间插入足够多的逗号,使总数为 5(= 6 列)。

这对我来说似乎是一种非常简单的方法。它适用于插入随机空格和逗号的任何输入变体,只要数字数据右对齐即可。

import re,io

def fix_line(line):
    # remove duplicate commas and spaces 
    line = re.sub(',,', ',', line.replace(' ', ''))
    # groups: first two rows / middle (non-captured) / numbers
    match = re.match(r'(\d+,[^,]+,),*(\d.+)', line)
    if not match: # removes the headers
        return ''
    # align numbers to right: 6 columns = 5 commas
    return match.groups()[0]+(','*(5-2-match.groups()[1].count(',')))+match.groups()[1]
    

data_corr = [fix_line(line) for line in lines]

df = pd.read_csv(io.StringIO('\n'.join(data_corr)),
                 names=re.sub(',,+', ',', lines[0]).split(',') # assign column names
                )

假设这个输入为变量lines:

['Id,Name,Age,,Score,Rang,Bonus',
 '181,ALEX,,,,20,987',
 '182,Julia,,,,18,8.390',
 '183,Marian,,,,21,9.170',
 '184,Julien,,0,175,60,9.095',
 'Id,Name,Age,,Score,Rang,Bonus',
 '215,Asma,26,,35,19,3.807',
 '216,Juan,,,,20,7.982',
 '217,Rami,,,,10,1.832',
 'Id,Name,Age,,Score,Rang,Bonus',
 '415,Jessica,,4 920,8 873,538,7.994',
 '416,Karen,,890,6,12,9.993',
 '417,Andrea,,0,69,283,7.200',
 'Id,Name,Age,,Score,Rang,Bonus',
 '419,Rym,10,,18,,10,7.196',
 '420,Noor,10,,70,,910,8.291',
 '421,Nathalie,0,,5,,0,0.900',
 '"",Id,Name,Age,,Score,Rang,Bonus',
 '456,,Joe,,10,13,0,74.917',
 '457,,Loula,,0,18,11,9.990',
 '458,,Maria,,0,15,172,6.425',
 '459,,Carl,,15,17,11,3.349',
 'Id,Name,Age,,Score,Rang,Bonus',
 '566,Diego,,,,0,3.680',
 '567,Carla,0,,26,1,19.361']

输出:

     Id      Name     Age   Score  Rang    Bonus
0   181      ALEX     NaN     NaN    20  987.000
1   182     Julia     NaN     NaN    18    8.390
2   183    Marian     NaN     NaN    21    9.170
3   184    Julien     0.0   175.0    60    9.095
4   215      Asma    26.0    35.0    19    3.807
5   216      Juan     NaN     NaN    20    7.982
6   217      Rami     NaN     NaN    10    1.832
7   415   Jessica  4920.0  8873.0   538    7.994
8   416     Karen   890.0     6.0    12    9.993
9   417    Andrea     0.0    69.0   283    7.200
10  419       Rym    10.0    18.0    10    7.196
11  420      Noor    10.0    70.0   910    8.291
12  421  Nathalie     0.0     5.0     0    0.900
13  456       Joe    10.0    13.0     0   74.917
14  457     Loula     0.0    18.0    11    9.990
15  458     Maria     0.0    15.0   172    6.425
16  459      Carl    15.0    17.0    11    3.349
17  566     Diego     NaN     NaN     0    3.680
18  567     Carla     0.0    26.0     1   19.361

注意。如果输入是文件,则首先使用以下方式读取行:

with open('/path/to/file', 'r') as f:
    lines = f.readlines()