使用 python 删除 .csv 文件中多余的逗号、space 和行偏移
Remove extra commas, space & lines offset in .csv file using python
我有一个 5 页的 pdf 文件,每页都有一个我需要提取的 table。我需要从每个页面中提取所有 tables 并使用 python 将它们保存为数据帧文件,所以我使用 tabula[= 将文件转换为 csv 文件39=]
tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')
文件 output.csv 的主要问题是有 几个额外的逗号 .
例子
Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361
当我将 csv 文件转换为 row/columns 时,我得到了一些行偏移量
检查下图以了解问题:
正如您在图像中看到的那样,有一些行偏移(文件每一页中的每个 table 都有特定的行偏移)我该如何解决这个问题
注意:
数据框应该有 6 列空字段。
我猜额外的逗号来自 pdf 文件中的 space 。如何从 csv 文件中删除多余的逗号或在 pdf 文件中删除多余的 space。
下图中的预期输出:
非常感谢您的帮助。
将 CSV 内容加载到 dataframe
后,删除第三列,您将获得所需格式的数据。
注意:我没有在此处添加任何列名。您可以在删除列后稍后添加它们
import pandas as pd
l = ['181,ALEX,,,,20,987', '182,Julia,,18,79,98,8.390', '183,Marian,,21,89,70,9.170', '184,Julien,,,,60,9.095']
df = pd.DataFrame([sub.split(",") for sub in l])
df.drop(2, inplace=True, axis=1)
print(df)
Output:
0 1 3 4 5 6
0 181 ALEX 20 987
1 182 Julia 18 79 98 8.390
2 183 Marian 21 89 70 9.170
3 184 Julien 60 9.095
以下方法可能有效,但并不理想:
import pandas as pd
import csv
data = []
with open('output.csv') as f_input:
csv_input = csv.reader(f_input)
header = [v for v in next(csv_input) if v] # Remove empty column names
for row in csv_input:
empty = row.index('')
row = [v.replace(' ', '') for v in row if v]
if row[0] != 'Id':
row = row[:empty] + ['' for _ in range(6 - len(row))] + row[empty:]
data.append(row)
df = pd.DataFrame(data, columns=header)
print(df)
给你:
Id Name Age Score Rang Bonus
0 181 ALEX 20 987
1 182 Julia 18 8.390
2 183 Marian 21 9.170
3 184 Julien 0 175 60 9.095
4 215 Asma 26 35 19 3.807
5 216 Juan 20 7.982
6 217 Rami 10 1.832
7 415 Jessica 4920 8873 538 7.994
8 416 Karen 890 6 12 9.993
9 417 Andrea 0 69 283 7.200
10 419 Rym 10 18 10 7.196
11 420 Noor 10 70 910 8.291
12 421 Nathalie 0 5 0 0.900
13 456 Joe 10 13 0 74.917
14 457 Loula 0 18 11 9.990
15 458 Maria 0 15 172 6.425
16 459 Carl 15 17 11 3.349
17 566 Diego 0 3.680
18 567 Carla 0 26 1 19.361
它的工作原理是删除所有空白条目,然后在第一个空白条目返回 6 个值之后填充剩余条目。由于年龄列似乎是可选的,它可能不是 100% 可靠。
我发现这比
更容易理解
这是一个生成器,生成与清理后的第一行长度相同的行。并删除第一个空字符串,直到一行具有正确的长度。
就像 Martin 的回答一样,它会根据您的示例数据生成您预期的数据框。
import pandas as pd
from io import StringIO
import csv
f = StringIO("""Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361""")
def clean_up(csv_file):
header = None
for line in csv_file:
if not header:
header = [v for v in line if v]
length = len(header)
continue
while len(line) > length:
line.remove('')
if line != header:
yield(dict(zip(header,line)))
df = pd.DataFrame(clean_up(csv.reader(f)))
print(df)
这给你:
Id Name Age Score Rang Bonus
0 181 ALEX 20 987
1 182 Julia 18 8.390
2 183 Marian 21 9.170
3 184 Julien 0 175 60 9.095
4 215 Asma 26 35 19 3.807
5 216 Juan 20 7.982
6 217 Rami 10 1.832
7 415 Jessica 4 920 8 873 538 7.994
8 416 Karen 890 6 12 9.993
9 417 Andrea 0 69 283 7.200
10 419 Rym 10 18 10 7.196
11 420 Noor 10 70 910 8.291
12 421 Nathalie 0 5 0 0.900
13 456 Joe 10 13 0 74.917
14 457 Loula 0 18 11 9.990
15 458 Maria 0 15 172 6.425
16 459 Carl 15 17 11 3.349
17 566 Diego 0 3.680
18 567 Carla 0 26 1 19.361
我的策略基于一个简短的正则表达式来捕获前两列和末尾的数字。
(\d+,[^,]+,) → numbers + comma + anything but comma + comma
,* → zero or more commas
(\d.+) → the rest of the line starting from the first number
然后我将这两个组连接起来,在中间插入足够多的逗号,使总数为 5(= 6 列)。
这对我来说似乎是一种非常简单的方法。它适用于插入随机空格和逗号的任何输入变体,只要数字数据右对齐即可。
import re,io
def fix_line(line):
# remove duplicate commas and spaces
line = re.sub(',,', ',', line.replace(' ', ''))
# groups: first two rows / middle (non-captured) / numbers
match = re.match(r'(\d+,[^,]+,),*(\d.+)', line)
if not match: # removes the headers
return ''
# align numbers to right: 6 columns = 5 commas
return match.groups()[0]+(','*(5-2-match.groups()[1].count(',')))+match.groups()[1]
data_corr = [fix_line(line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(data_corr)),
names=re.sub(',,+', ',', lines[0]).split(',') # assign column names
)
假设这个输入为变量lines
:
['Id,Name,Age,,Score,Rang,Bonus',
'181,ALEX,,,,20,987',
'182,Julia,,,,18,8.390',
'183,Marian,,,,21,9.170',
'184,Julien,,0,175,60,9.095',
'Id,Name,Age,,Score,Rang,Bonus',
'215,Asma,26,,35,19,3.807',
'216,Juan,,,,20,7.982',
'217,Rami,,,,10,1.832',
'Id,Name,Age,,Score,Rang,Bonus',
'415,Jessica,,4 920,8 873,538,7.994',
'416,Karen,,890,6,12,9.993',
'417,Andrea,,0,69,283,7.200',
'Id,Name,Age,,Score,Rang,Bonus',
'419,Rym,10,,18,,10,7.196',
'420,Noor,10,,70,,910,8.291',
'421,Nathalie,0,,5,,0,0.900',
'"",Id,Name,Age,,Score,Rang,Bonus',
'456,,Joe,,10,13,0,74.917',
'457,,Loula,,0,18,11,9.990',
'458,,Maria,,0,15,172,6.425',
'459,,Carl,,15,17,11,3.349',
'Id,Name,Age,,Score,Rang,Bonus',
'566,Diego,,,,0,3.680',
'567,Carla,0,,26,1,19.361']
输出:
Id Name Age Score Rang Bonus
0 181 ALEX NaN NaN 20 987.000
1 182 Julia NaN NaN 18 8.390
2 183 Marian NaN NaN 21 9.170
3 184 Julien 0.0 175.0 60 9.095
4 215 Asma 26.0 35.0 19 3.807
5 216 Juan NaN NaN 20 7.982
6 217 Rami NaN NaN 10 1.832
7 415 Jessica 4920.0 8873.0 538 7.994
8 416 Karen 890.0 6.0 12 9.993
9 417 Andrea 0.0 69.0 283 7.200
10 419 Rym 10.0 18.0 10 7.196
11 420 Noor 10.0 70.0 910 8.291
12 421 Nathalie 0.0 5.0 0 0.900
13 456 Joe 10.0 13.0 0 74.917
14 457 Loula 0.0 18.0 11 9.990
15 458 Maria 0.0 15.0 172 6.425
16 459 Carl 15.0 17.0 11 3.349
17 566 Diego NaN NaN 0 3.680
18 567 Carla 0.0 26.0 1 19.361
注意。如果输入是文件,则首先使用以下方式读取行:
with open('/path/to/file', 'r') as f:
lines = f.readlines()
我有一个 5 页的 pdf 文件,每页都有一个我需要提取的 table。我需要从每个页面中提取所有 tables 并使用 python 将它们保存为数据帧文件,所以我使用 tabula[= 将文件转换为 csv 文件39=]
tabula.convert_into('input.pdf', "output.csv", output_format="csv", pages='all')
文件 output.csv 的主要问题是有 几个额外的逗号 .
例子
Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361
当我将 csv 文件转换为 row/columns 时,我得到了一些行偏移量
检查下图以了解问题:
注意: 数据框应该有 6 列空字段。 我猜额外的逗号来自 pdf 文件中的 space 。如何从 csv 文件中删除多余的逗号或在 pdf 文件中删除多余的 space。
下图中的预期输出:
非常感谢您的帮助。
将 CSV 内容加载到 dataframe
后,删除第三列,您将获得所需格式的数据。
注意:我没有在此处添加任何列名。您可以在删除列后稍后添加它们
import pandas as pd
l = ['181,ALEX,,,,20,987', '182,Julia,,18,79,98,8.390', '183,Marian,,21,89,70,9.170', '184,Julien,,,,60,9.095']
df = pd.DataFrame([sub.split(",") for sub in l])
df.drop(2, inplace=True, axis=1)
print(df)
Output:
0 1 3 4 5 6
0 181 ALEX 20 987
1 182 Julia 18 79 98 8.390
2 183 Marian 21 89 70 9.170
3 184 Julien 60 9.095
以下方法可能有效,但并不理想:
import pandas as pd
import csv
data = []
with open('output.csv') as f_input:
csv_input = csv.reader(f_input)
header = [v for v in next(csv_input) if v] # Remove empty column names
for row in csv_input:
empty = row.index('')
row = [v.replace(' ', '') for v in row if v]
if row[0] != 'Id':
row = row[:empty] + ['' for _ in range(6 - len(row))] + row[empty:]
data.append(row)
df = pd.DataFrame(data, columns=header)
print(df)
给你:
Id Name Age Score Rang Bonus
0 181 ALEX 20 987
1 182 Julia 18 8.390
2 183 Marian 21 9.170
3 184 Julien 0 175 60 9.095
4 215 Asma 26 35 19 3.807
5 216 Juan 20 7.982
6 217 Rami 10 1.832
7 415 Jessica 4920 8873 538 7.994
8 416 Karen 890 6 12 9.993
9 417 Andrea 0 69 283 7.200
10 419 Rym 10 18 10 7.196
11 420 Noor 10 70 910 8.291
12 421 Nathalie 0 5 0 0.900
13 456 Joe 10 13 0 74.917
14 457 Loula 0 18 11 9.990
15 458 Maria 0 15 172 6.425
16 459 Carl 15 17 11 3.349
17 566 Diego 0 3.680
18 567 Carla 0 26 1 19.361
它的工作原理是删除所有空白条目,然后在第一个空白条目返回 6 个值之后填充剩余条目。由于年龄列似乎是可选的,它可能不是 100% 可靠。
我发现这比
这是一个生成器,生成与清理后的第一行长度相同的行。并删除第一个空字符串,直到一行具有正确的长度。
就像 Martin 的回答一样,它会根据您的示例数据生成您预期的数据框。
import pandas as pd
from io import StringIO
import csv
f = StringIO("""Id,Name,Age,,Score,Rang,Bonus
181,ALEX,,,,20,987
182,Julia,,,,18,8.390
183,Marian,,,,21,9.170
184,Julien,,0,175,60,9.095
Id,Name,Age,,Score,Rang,Bonus
215,Asma,26,,35,19,3.807
216,Juan,,,,20,7.982
217,Rami,,,,10,1.832
Id,Name,Age,,Score,Rang,Bonus
415,Jessica,,4 920,8 873,538,7.994
416,Karen,,890,6,12,9.993
417,Andrea,,0,69,283,7.200
Id,Name,Age,,Score,Rang,Bonus
419,Rym,10,,18,,10,7.196
420,Noor,10,,70,,910,8.291
421,Nathalie,0,,5,,0,0.900
"",Id,Name,Age,,Score,Rang,Bonus
456,,Joe,,10,13,0,74.917
457,,Loula,,0,18,11,9.990
458,,Maria,,0,15,172,6.425
459,,Carl,,15,17,11,3.349
Id,Name,Age,,Score,Rang,Bonus
566,Diego,,,,0,3.680
567,Carla,0,,26,1,19.361""")
def clean_up(csv_file):
header = None
for line in csv_file:
if not header:
header = [v for v in line if v]
length = len(header)
continue
while len(line) > length:
line.remove('')
if line != header:
yield(dict(zip(header,line)))
df = pd.DataFrame(clean_up(csv.reader(f)))
print(df)
这给你:
Id Name Age Score Rang Bonus
0 181 ALEX 20 987
1 182 Julia 18 8.390
2 183 Marian 21 9.170
3 184 Julien 0 175 60 9.095
4 215 Asma 26 35 19 3.807
5 216 Juan 20 7.982
6 217 Rami 10 1.832
7 415 Jessica 4 920 8 873 538 7.994
8 416 Karen 890 6 12 9.993
9 417 Andrea 0 69 283 7.200
10 419 Rym 10 18 10 7.196
11 420 Noor 10 70 910 8.291
12 421 Nathalie 0 5 0 0.900
13 456 Joe 10 13 0 74.917
14 457 Loula 0 18 11 9.990
15 458 Maria 0 15 172 6.425
16 459 Carl 15 17 11 3.349
17 566 Diego 0 3.680
18 567 Carla 0 26 1 19.361
我的策略基于一个简短的正则表达式来捕获前两列和末尾的数字。
(\d+,[^,]+,) → numbers + comma + anything but comma + comma
,* → zero or more commas
(\d.+) → the rest of the line starting from the first number
然后我将这两个组连接起来,在中间插入足够多的逗号,使总数为 5(= 6 列)。
这对我来说似乎是一种非常简单的方法。它适用于插入随机空格和逗号的任何输入变体,只要数字数据右对齐即可。
import re,io
def fix_line(line):
# remove duplicate commas and spaces
line = re.sub(',,', ',', line.replace(' ', ''))
# groups: first two rows / middle (non-captured) / numbers
match = re.match(r'(\d+,[^,]+,),*(\d.+)', line)
if not match: # removes the headers
return ''
# align numbers to right: 6 columns = 5 commas
return match.groups()[0]+(','*(5-2-match.groups()[1].count(',')))+match.groups()[1]
data_corr = [fix_line(line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(data_corr)),
names=re.sub(',,+', ',', lines[0]).split(',') # assign column names
)
假设这个输入为变量lines
:
['Id,Name,Age,,Score,Rang,Bonus',
'181,ALEX,,,,20,987',
'182,Julia,,,,18,8.390',
'183,Marian,,,,21,9.170',
'184,Julien,,0,175,60,9.095',
'Id,Name,Age,,Score,Rang,Bonus',
'215,Asma,26,,35,19,3.807',
'216,Juan,,,,20,7.982',
'217,Rami,,,,10,1.832',
'Id,Name,Age,,Score,Rang,Bonus',
'415,Jessica,,4 920,8 873,538,7.994',
'416,Karen,,890,6,12,9.993',
'417,Andrea,,0,69,283,7.200',
'Id,Name,Age,,Score,Rang,Bonus',
'419,Rym,10,,18,,10,7.196',
'420,Noor,10,,70,,910,8.291',
'421,Nathalie,0,,5,,0,0.900',
'"",Id,Name,Age,,Score,Rang,Bonus',
'456,,Joe,,10,13,0,74.917',
'457,,Loula,,0,18,11,9.990',
'458,,Maria,,0,15,172,6.425',
'459,,Carl,,15,17,11,3.349',
'Id,Name,Age,,Score,Rang,Bonus',
'566,Diego,,,,0,3.680',
'567,Carla,0,,26,1,19.361']
输出:
Id Name Age Score Rang Bonus
0 181 ALEX NaN NaN 20 987.000
1 182 Julia NaN NaN 18 8.390
2 183 Marian NaN NaN 21 9.170
3 184 Julien 0.0 175.0 60 9.095
4 215 Asma 26.0 35.0 19 3.807
5 216 Juan NaN NaN 20 7.982
6 217 Rami NaN NaN 10 1.832
7 415 Jessica 4920.0 8873.0 538 7.994
8 416 Karen 890.0 6.0 12 9.993
9 417 Andrea 0.0 69.0 283 7.200
10 419 Rym 10.0 18.0 10 7.196
11 420 Noor 10.0 70.0 910 8.291
12 421 Nathalie 0.0 5.0 0 0.900
13 456 Joe 10.0 13.0 0 74.917
14 457 Loula 0.0 18.0 11 9.990
15 458 Maria 0.0 15.0 172 6.425
16 459 Carl 15.0 17.0 11 3.349
17 566 Diego NaN NaN 0 3.680
18 567 Carla 0.0 26.0 1 19.361
注意。如果输入是文件,则首先使用以下方式读取行:
with open('/path/to/file', 'r') as f:
lines = f.readlines()