Python:将pdf转csv(多行列)
Python: convert pdf to csv (multi-line column)
我的 CSV 是:
,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé
Serrure du bas en mauvais état le système est
cassé au niveau de la chaînette
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois
但我想要这个:
,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé; Serrure du bas en mauvais état le système ...
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois
我的代码只是将每个页面的一个或多个 pdf 转换为 csv,如下所示:
import os
import io
import shutil
import tabula
import time
start_time = time.time()
path = './'
i=0
j=0
for( directory, subdirectories, file ) in os.walk(path):
for f in file:
if f.endswith('.pdf'):
df = tabula.read_pdf(str(directory) + "/" + str(f), pages='all')
i=0
j+=1
for curr_df in df:
i+=1
curr_df.to_csv('./' + str(directory) + '-' + str(i) + '.csv')
print("--- convert %d .PDF to %d .CSV in %s seconds ---" % (j, i, time.time() - start_time))
我的问题也是因为我无法逐案处理。我需要能够以相同的方式处理所有 csv
您可以打开 csv
,阅读各行,并将不以空开头 (header) 或以数字开头的字符串添加到上一行。然后将这些行写入一个新的 csv
文件:
with open('filename.csv') as f:
text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
lines = []
for i in text:
try:
if i[0] ==',' or i[0].isnumeric():
lines.append(i)
else:
lines[-1] = lines[-1] + "; " + i
except:
continue
with open('new_file.csv', mode='wt', encoding='utf-8') as newfile:
newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
要处理目录中的所有文件,我们可以将其放入一个函数并将目录中的所有文件提供给该函数:
import os as os
import glob as glob
def process_csv(filepath):
with open(filepath) as f:
text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
lines = []
for i in text:
try:
if i[0] ==',' or i[0].isnumeric():
lines.append(i)
else:
lines[-1] = lines[-1] + "; " + i
except:
continue
with open(os.path.basename(filepath) + '_fixed.csv', mode='wt', encoding='utf-8') as newfile:
newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
print('fixed: ' + os.path.basename(filepath) + '_fixed.csv')
files = glob.glob('./*.csv') #use glob to create a list of filepath of csv files in a directory
for file in files: # loop through the list and feed each file to the function process_csv
process_csv(file)
对于@SergeBallesta,这是我所拥有的:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 8 non-null object
1 Élément 6 non-null object
2 État général 2 non-null object
3 Observations 4 non-null object
dtypes: object(4)
memory usage: 384.0+ bytes
None
Unnamed: 0 Élément État général Observations
0 0 ENTRÉE Etat d'usage NaN
1 1 PORTES Etat d'usage Chaînette cassé
2 Serrure du bas en mauvais état le système est NaN NaN NaN
3 cassé au niveau de la chaînette NaN NaN NaN
4 2 ENTRÉE / PORTESENTRÉE / PORTES NaN NaN
5 3 Type de porte NaN Porte blindée
6 4 Poignée NaN Bon état
7 5 Couleur NaN Bois
和
import pandas as pd
df = pd.read_csv('../CSV/Entire/PDF-1.csv')
print(df.info())
print(df)
对于@RJAdriaansen,这是我得到的错误:
fixed: PDF-8.csv_fixed.csv
fixed: PDF-5.csv_fixed.csv
fixed: PDF-7.csv_fixed.csv
fixed: PDF-6.csv_fixed.csv
fixed: PDF-2.csv_fixed.csv
fixed: PDF-10.csv_fixed.csv
fixed: PDF-3.csv_fixed.csv
fixed: PDF-4.csv_fixed.csv
Traceback (most recent call last):
File "corrCSV_v2.py", line 24, in <module>
process_csv(file)
File "corrCSV_v2.py", line 12, in process_csv
if i[0] ==',' or i[0].isnumeric():
IndexError: string index out of range
错误来自这个.csv
,Élément,État général,Observations
0,CUISINE,Etat d'usage,
1,CUISINECUISINE 15CUISINE 18
CUISINE 19,,
我认为这是由于空行
我的 CSV 是:
,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé
Serrure du bas en mauvais état le système est
cassé au niveau de la chaînette
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois
但我想要这个:
,Élément,État général,Observations
0,ENTRÉE,Etat d'usage,
1,PORTES,Etat d'usage,Chaînette cassé; Serrure du bas en mauvais état le système ...
2,ENTRÉE / PORTESENTRÉE / PORTES,,
3,Type de porte,,Porte blindée
4,Poignée,,Bon état
5,Couleur,,Bois
我的代码只是将每个页面的一个或多个 pdf 转换为 csv,如下所示:
import os
import io
import shutil
import tabula
import time
start_time = time.time()
path = './'
i=0
j=0
for( directory, subdirectories, file ) in os.walk(path):
for f in file:
if f.endswith('.pdf'):
df = tabula.read_pdf(str(directory) + "/" + str(f), pages='all')
i=0
j+=1
for curr_df in df:
i+=1
curr_df.to_csv('./' + str(directory) + '-' + str(i) + '.csv')
print("--- convert %d .PDF to %d .CSV in %s seconds ---" % (j, i, time.time() - start_time))
我的问题也是因为我无法逐案处理。我需要能够以相同的方式处理所有 csv
您可以打开 csv
,阅读各行,并将不以空开头 (header) 或以数字开头的字符串添加到上一行。然后将这些行写入一个新的 csv
文件:
with open('filename.csv') as f:
text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
lines = []
for i in text:
try:
if i[0] ==',' or i[0].isnumeric():
lines.append(i)
else:
lines[-1] = lines[-1] + "; " + i
except:
continue
with open('new_file.csv', mode='wt', encoding='utf-8') as newfile:
newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
要处理目录中的所有文件,我们可以将其放入一个函数并将目录中的所有文件提供给该函数:
import os as os
import glob as glob
def process_csv(filepath):
with open(filepath) as f:
text = [line.rstrip() for line in f.readlines()] #remove newline character with rstrip()
lines = []
for i in text:
try:
if i[0] ==',' or i[0].isnumeric():
lines.append(i)
else:
lines[-1] = lines[-1] + "; " + i
except:
continue
with open(os.path.basename(filepath) + '_fixed.csv', mode='wt', encoding='utf-8') as newfile:
newfile.write('\n'.join(lines)) # reinsert newline characters with '\n'.join()
print('fixed: ' + os.path.basename(filepath) + '_fixed.csv')
files = glob.glob('./*.csv') #use glob to create a list of filepath of csv files in a directory
for file in files: # loop through the list and feed each file to the function process_csv
process_csv(file)
对于@SergeBallesta,这是我所拥有的:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 8 non-null object
1 Élément 6 non-null object
2 État général 2 non-null object
3 Observations 4 non-null object
dtypes: object(4)
memory usage: 384.0+ bytes
None
Unnamed: 0 Élément État général Observations
0 0 ENTRÉE Etat d'usage NaN
1 1 PORTES Etat d'usage Chaînette cassé
2 Serrure du bas en mauvais état le système est NaN NaN NaN
3 cassé au niveau de la chaînette NaN NaN NaN
4 2 ENTRÉE / PORTESENTRÉE / PORTES NaN NaN
5 3 Type de porte NaN Porte blindée
6 4 Poignée NaN Bon état
7 5 Couleur NaN Bois
和
import pandas as pd
df = pd.read_csv('../CSV/Entire/PDF-1.csv')
print(df.info())
print(df)
对于@RJAdriaansen,这是我得到的错误:
fixed: PDF-8.csv_fixed.csv
fixed: PDF-5.csv_fixed.csv
fixed: PDF-7.csv_fixed.csv
fixed: PDF-6.csv_fixed.csv
fixed: PDF-2.csv_fixed.csv
fixed: PDF-10.csv_fixed.csv
fixed: PDF-3.csv_fixed.csv
fixed: PDF-4.csv_fixed.csv
Traceback (most recent call last):
File "corrCSV_v2.py", line 24, in <module>
process_csv(file)
File "corrCSV_v2.py", line 12, in process_csv
if i[0] ==',' or i[0].isnumeric():
IndexError: string index out of range
错误来自这个.csv
,Élément,État général,Observations
0,CUISINE,Etat d'usage,
1,CUISINECUISINE 15CUISINE 18
CUISINE 19,,
我认为这是由于空行