tabula extract table from pdf 删除换行符

tabula extract table from pdf remove line break

我有一个 table 在 pdf 文件中包含换行的文本

我使用 tabula 从 pdf 文件中提取 table

file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1,lattice=True)
table[0]

然而,最终结果是这样的:

有没有办法将 pdf 中 table 的换行符或换行文本解释为它自己的行?没有额外的行?

使用表格的最终结果应该是这样的:

您需要添加一个参数。替换

file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1)
table[0]

file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1, lattice = True)
table[0]

所有这些都是根据文档 here

这是一个例子:

查看文章“https://effectivehealthcare.ahrq.gov/sites/default/files/pdf/methods-guidance-tests-bias_methods.pdf”

import tabula
import io
import pandas as pd

file1 = r"C:\Users\s-degossondevarennes\.......\Desktop\methods-guidance-tests-bias_methods.pdf"
table = tabula.read_pdf(file1,pages=3,lattice=True, )

df = table[0]
df = df.drop(['Unnamed: 1','Unnamed: 2','Description','Unnamed: 3'],axis=1)
df

returns:

     Unnamed: 0  \
0                                    NaN   
1                        Spectrum effect   
2                           Context bias   
3                         Selection bias   
4                                    NaN   
5            Variation in test execution   
6           Variation in test technology   
7                      Treatment paradox   
8               Disease progression bias   
9                                    NaN   
10     Inappropriate reference\rstandard   
11        Differential verification bias   
12             Partial verification bias   
13                                   NaN   
14                           Review bias   
15                  Clinical review bias   
16                    Incorporation bias   
17                  Observer variability   
18                                   NaN   
19    Handling of indeterminate\rresults   
20  Arbitrary choice of threshold\rvalue   

                            Source of Systematic Bias  
0                                          Population  
1   Tests may perform differently in various sampl...  
2   Prevalence of the target condition varies acco...  
3   The selection process determines the compositi...  
4                Test Protocol: Materials and Methods  
5   A sufficient description of the execution of i...  
6   When the characteristics of a medical test cha...  
7   Occurs when treatment is started on the basis ...  
8   Occurs when the index test is performed an unu...  
9       Reference Standard and Verification Procedure  
10  Errors of imperfect reference standard bias th...  
11  Part of the index test results is verified by ...  
12  Only a selected sample of patients who underwe...  
13                                     Interpretation  
14  Interpretation of the index test or reference ...  
15  Availability of clinical data such as age, sex...  
16  The result of the index test is used to establ...  
17  The reproducibility of test results is one det...  
18                                           Analysis  
19  A medical test can produce an uninterpretable ...  
20  The selection of the threshold value for the i...  

Source of Systematic Bias 列中的三个点表示该单元格中的所有内容(带有换行符)我都视为单个单元格(项目),而不是多个单元格。另一个证明是

df.iloc[2,1]

returns单元格内容:

'Prevalence of the target condition varies according to setting and may affect\restimates of test performance. Interpreters may consider test results to be\rpositive more frequently in settings with higher disease prevalence, which may\ralso affect estimates of test performance.'

你的 pdf 一定有什么东西。如果在线可用,请分享 link,我会看一下。