如何使用 itextsharp 从 table-structured PDF 中读取数据？

Question

我在从 pdf 文件读取一些数据时遇到问题。
我的文件是结构化的，它包含表格和纯文本。标准解析器从同一行的不同列中读取数据。例如：

Some Table Header  
Data Col1a     Data Col2a      Data Col3a
Data Col1b     Data Col2b      Data Col3b
               Data Col2c

使用此代码

        PdfReader reader = new PdfReader(pdfName);

        List<String> text = new List<String>();
        String page;
        List<String> pageStrings;
        string[] separators = { "\n", "\r\n" };

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            page = PdfTextExtractor.GetTextFromPage(reader, i);
            pageStrings = new List<string>(page.Split(separators, StringSplitOptions.RemoveEmptyEntries));
            text.AddRange(pageStrings);

        }

        reader.Close();

        return text;

将被连接成字符串：

Some Table Header
Data Col1a Data Col2a Data Col3a  
Data Col1b Data Col2b Data Col3b  
Data Col2c

我想要获得反映块数据的串联字符串。我想为上面的例子得到这样的字符串：

Some Table Header
Data Col1a Data Col1b   
Data Col2a Data Col2b Data Col2c  
Data Col3a Data Col3b

有谁知道如何调整 itextsharp 以获得 pdf 解析器的这种行为？也许有人有合适的代码示例？
示例 PDF 文件为 here

Answer 1

OP 的示例文件包含多个部分，如下所示：

以及评论中提到的 OP：

another one tool parse my PDF exactly like I want. [...]

PS: this tool is pdfbox

在此方法中使用 PDFBox（v1.8.10，当前发布版本）：

String extract(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(document);
}

returns 上面显示的部分

Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY
 MEDICAL
Trip #: 314-A
Comments: ----LIVERY---
Destination:Pick-up:
Call Type: Livery
<Doctor Office>
REGO PARK,  (631) 
000-0000
(718) 896-5953
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154
11:00:00 PAT, MIKHAIL
Trip #: 314-B
Comments:  ----LIVERY---
Destination:Pick-up:
Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154
<Doctor Office>
63-6 REGO PARK, NY 
11374 (631) 000-0000
11:01:00 PAT, MIKHAIL

这并不是真正的按列提取，但某些信息块（如地址块）保留在一起。

使用 iText(Sharp) 获得相同的输出实际上非常简单：只需明确使用 SimpleTextExtractionStrategy 而不是默认使用的 LocationTextExtractionStrategy，即必须替换这条线

page = PdfTextExtractor.GetTextFromPage(reader, i);

来自

page = PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy());

除了每个数据集一个 space 个字符（iText(Sharp) 提取 Destination: Pick-up: 而不是 Destination:Pick-up:）外，结果是相同的。

关于您从 PDFBox 中提取文本的结论：

So I think that PDF is really table structured.

实际上这个提取顺序仅仅意味着在PDF页面内容流中绘制字符串段的操作就是按照这个顺序发生的。由于根据 PDF 规范，这些操作的顺序是任意的，生成这些 PDF 的软件的任何更新都可能导致 PDFBox PDFTextStripper 和 iText SimpleTextExtractionStrategy 从中提取的文件只是一堆难以理解的字符.

PS：如果将 PDFBox PDFTextStripper 属性 SortByPosition 设置为 true 像这样

    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);

然后 PDFBox 像 iText(Sharp) 一样使用（默认）LocationTextExtractionStrategy 提取文本

OP 表示对内容流中固有的块结构感兴趣。在通用 PDF 中最明显的结构是文本对象（其中可以绘制多个字符串）。

在手头的案例中，使用了 SimpleTextExtractionStrategy。它可以很容易地扩展为在其输出中还包含与文本对象的开始和结束相对应的标记。在 Java 中，这可以通过使用匿名 class 来完成，如下所示：

return PdfTextExtractor.getTextFromPage(reader, pageNo, new SimpleTextExtractionStrategy()
{
    boolean empty = true;

    @Override
    public void beginTextBlock()
    {
        if (!empty)
            appendTextChunk("<BLOCK>");
        super.beginTextBlock();
    }

    @Override
    public void endTextBlock()
    {
        if (!empty)
            appendTextChunk("</BLOCK>\n");
        super.endTextBlock();
    }

    @Override
    public String getResultantText()
    {
        if (empty)
            return super.getResultantText();
        else
            return "<BLOCK>" + super.getResultantText();
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        empty = false;
        super.renderText(renderInfo);
    }
});

(TextExtraction.java方法extractSimple)

（这个 Java 代码应该很容易翻译成 C#。玩弄 empty 布尔值可能看起来很有趣；但是这是必要的，因为基础 class 假设一旦将某些块附加到提取的内容，就会设置某些附加属性。）

使用这一扩展策略可以得到上面显示的部分：

<BLOCK>Driver Book for 8/5/2015
Company IS MEDICAL; AND Date of Service IS BETWEEN 08/05/2015 AND 08/05/2015; AND Status IS Assigned; AND Vehicles IS  MEDICAL: 
CATY</BLOCK>
<BLOCK>
 MEDICAL</BLOCK>
<BLOCK>
Trip #: 314-A</BLOCK>
<BLOCK>
Comments: ----LIVERY---</BLOCK>
<BLOCK>
Destination: Pick-up:</BLOCK>
<BLOCK>
Call Type: Livery
<Doctor Office>
REGO PARK,  (631) 
000-0000
(718) 896-5953</BLOCK>
<BLOCK>
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154</BLOCK>
<BLOCK>
11:00:00</BLOCK>
<BLOCK> PAT, MIKHAIL</BLOCK>
<BLOCK>
Trip #: 314-B</BLOCK>
<BLOCK>
Comments:  ----LIVERY---</BLOCK>
<BLOCK>
Destination: Pick-up:</BLOCK>
<BLOCK>
Call Type: Livery
74- AVE 204E  HEIGHTS, NY 
11372 (718) 639-4154</BLOCK>
<BLOCK>
<Doctor Office>
63-6 REGO PARK, NY 
11374 (631) 000-0000</BLOCK>
<BLOCK>
11:01:00</BLOCK>
<BLOCK> PAT, MIKHAIL</BLOCK>

由于这会将地址保持在同一个块中，这可能有助于提取。

如何使用 itextsharp 从 table-structured PDF 中读取数据？

How to read data from table-structured PDF using itextsharp?

c#

itextsharp