用于检测 PDF 文本页面更改的字符串拆分

String split for detection of a text page change from PDF

我正在尝试使用 itextsharp 库分析 PDF 文档...最终目的是读取所有文本并将其拆分为每一行。

为此,我使用了阅读文本的拆分功能...我在字符串 var 中有完整的文本。

 Dim RigheTesto As String()
 RigheTesto = testoEstrapolato.Split({vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries)

拆分函数工作正常,我获得了一个字符串数组,如“数据类型:值”,原始文件中的每一行都有一个数组...

...但是当拆分遇到页面变化时(在原始PDF中)不明白是不同的行并且它与之前的合并 ...

请问您知道如何解决这个问题吗?

感谢您的宝贵时间!

下面展示了如何使用 NuGet 包 iTextSharp(已使用 v5.5.13.2 测试)从 PDF 文件中提取文本。

Download/install NuGet 包 iTextSharp

创建一个class(名称:PdfPageInfo.vb)

Public Class PdfPageInfo
    Public Property PageNumber As Integer
    Public Property Lines As List(Of String) = New List(Of String)
End Class

创建模块(名称:HelperiTextSharp.vb)

Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser

Module HelperiTextSharp
    Public Function ExtractText(filename As String) As List(Of PdfPageInfo)
        Dim pageInfoList As List(Of PdfPageInfo) = New List(Of PdfPageInfo)

        Using reader As PdfReader = New PdfReader(filename)
            For i As Integer = 1 To reader.NumberOfPages Step 1

                'create new instance
                Dim pageInfo As PdfPageInfo = New PdfPageInfo()

                'set value
                pageInfo.PageNumber = i

                'get text from PDF page
                Dim pageText As String = PdfTextExtractor.GetTextFromPage(reader, i)

                'split on newline and set value
                pageInfo.Lines = pageText.Split(New String() {vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries).ToList()

                'add 
                pageInfoList.Add(pageInfo)
            Next
        End Using

        Return pageInfoList
    End Function
End Module

用法:

Dim ofd As OpenFileDialog = New OpenFileDialog()
ofd.Filter = "PDF files(*.pdf)|*.pdf"

If ofd.ShowDialog = DialogResult.OK Then
    Dim pdfPageInfoList As List(Of PdfPageInfo) = HelperiTextSharp.ExtractText(ofd.FileName)

    For Each pInfo As PdfPageInfo In pdfPageInfoList
        Debug.WriteLine("Page Number: " & pInfo.PageNumber.ToString())

        For i As Integer = 0 To pInfo.Lines.Count - 1 Step 1
            Debug.WriteLine("[" & i & "]: " & pInfo.Lines(i))
        Next

        Debug.WriteLine("---------------------------------" & vbCrLf)
    Next
End If

资源: