用于检测 PDF 文本页面更改的字符串拆分
String split for detection of a text page change from PDF
我正在尝试使用 itextsharp 库分析 PDF 文档...最终目的是读取所有文本并将其拆分为每一行。
为此,我使用了阅读文本的拆分功能...我在字符串 var 中有完整的文本。
Dim RigheTesto As String()
RigheTesto = testoEstrapolato.Split({vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries)
拆分函数工作正常,我获得了一个字符串数组,如“数据类型:值”,原始文件中的每一行都有一个数组...
...但是当拆分遇到页面变化时(在原始PDF中)不明白是不同的行并且它与之前的合并 ...
请问您知道如何解决这个问题吗?
感谢您的宝贵时间!
下面展示了如何使用 NuGet 包 iTextSharp(已使用 v5.5.13.2 测试)从 PDF 文件中提取文本。
Download/install NuGet 包 iTextSharp
创建一个class(名称:PdfPageInfo.vb)
Public Class PdfPageInfo
Public Property PageNumber As Integer
Public Property Lines As List(Of String) = New List(Of String)
End Class
创建模块(名称:HelperiTextSharp.vb)
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Module HelperiTextSharp
Public Function ExtractText(filename As String) As List(Of PdfPageInfo)
Dim pageInfoList As List(Of PdfPageInfo) = New List(Of PdfPageInfo)
Using reader As PdfReader = New PdfReader(filename)
For i As Integer = 1 To reader.NumberOfPages Step 1
'create new instance
Dim pageInfo As PdfPageInfo = New PdfPageInfo()
'set value
pageInfo.PageNumber = i
'get text from PDF page
Dim pageText As String = PdfTextExtractor.GetTextFromPage(reader, i)
'split on newline and set value
pageInfo.Lines = pageText.Split(New String() {vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries).ToList()
'add
pageInfoList.Add(pageInfo)
Next
End Using
Return pageInfoList
End Function
End Module
用法:
Dim ofd As OpenFileDialog = New OpenFileDialog()
ofd.Filter = "PDF files(*.pdf)|*.pdf"
If ofd.ShowDialog = DialogResult.OK Then
Dim pdfPageInfoList As List(Of PdfPageInfo) = HelperiTextSharp.ExtractText(ofd.FileName)
For Each pInfo As PdfPageInfo In pdfPageInfoList
Debug.WriteLine("Page Number: " & pInfo.PageNumber.ToString())
For i As Integer = 0 To pInfo.Lines.Count - 1 Step 1
Debug.WriteLine("[" & i & "]: " & pInfo.Lines(i))
Next
Debug.WriteLine("---------------------------------" & vbCrLf)
Next
End If
资源:
我正在尝试使用 itextsharp 库分析 PDF 文档...最终目的是读取所有文本并将其拆分为每一行。
为此,我使用了阅读文本的拆分功能...我在字符串 var 中有完整的文本。
Dim RigheTesto As String()
RigheTesto = testoEstrapolato.Split({vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries)
拆分函数工作正常,我获得了一个字符串数组,如“数据类型:值”,原始文件中的每一行都有一个数组...
...但是当拆分遇到页面变化时(在原始PDF中)不明白是不同的行并且它与之前的合并 ...
请问您知道如何解决这个问题吗?
感谢您的宝贵时间!
下面展示了如何使用 NuGet 包 iTextSharp(已使用 v5.5.13.2 测试)从 PDF 文件中提取文本。
Download/install NuGet 包 iTextSharp
创建一个class(名称:PdfPageInfo.vb)
Public Class PdfPageInfo
Public Property PageNumber As Integer
Public Property Lines As List(Of String) = New List(Of String)
End Class
创建模块(名称:HelperiTextSharp.vb)
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Module HelperiTextSharp
Public Function ExtractText(filename As String) As List(Of PdfPageInfo)
Dim pageInfoList As List(Of PdfPageInfo) = New List(Of PdfPageInfo)
Using reader As PdfReader = New PdfReader(filename)
For i As Integer = 1 To reader.NumberOfPages Step 1
'create new instance
Dim pageInfo As PdfPageInfo = New PdfPageInfo()
'set value
pageInfo.PageNumber = i
'get text from PDF page
Dim pageText As String = PdfTextExtractor.GetTextFromPage(reader, i)
'split on newline and set value
pageInfo.Lines = pageText.Split(New String() {vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries).ToList()
'add
pageInfoList.Add(pageInfo)
Next
End Using
Return pageInfoList
End Function
End Module
用法:
Dim ofd As OpenFileDialog = New OpenFileDialog()
ofd.Filter = "PDF files(*.pdf)|*.pdf"
If ofd.ShowDialog = DialogResult.OK Then
Dim pdfPageInfoList As List(Of PdfPageInfo) = HelperiTextSharp.ExtractText(ofd.FileName)
For Each pInfo As PdfPageInfo In pdfPageInfoList
Debug.WriteLine("Page Number: " & pInfo.PageNumber.ToString())
For i As Integer = 0 To pInfo.Lines.Count - 1 Step 1
Debug.WriteLine("[" & i & "]: " & pInfo.Lines(i))
Next
Debug.WriteLine("---------------------------------" & vbCrLf)
Next
End If
资源: