根据页面上的条码拆分多页 PDF,直到下一个唯一条码
Split multi-page PDFs based on barcode on page till the next unique barcode
到目前为止,我有 VB.NET 个适用于一个文件的代码,它会根据每个页面上的唯一条形码拆分该文件以识别它。
每个条形码是以下之一:
封面分裂
投诉分裂
展览拆分
米尔斯普利特
SUMSPLIT
问题是:比如说,第一页的条码是COVERSPLIT,因为它是封面sheet,但下一页sheet也是封面sheet但是没有上面有条形码。因此,当我 运行 我的代码时,它只提取带有那些已识别条码的 sheet,并忽略那些没有识别的条码。
我试过这样做:
Imports Bytescout.PDFExtractor
Imports System.Collections
Imports System.Collections.Generic
Imports System.IO.Path
Class Program
Friend Shared Sub Main(args As String())
Dim Dir As String = "G:\Word\Department Folders\Pre-Suit\Drafts-IL-IL_AttyReview18-09\Reviewed\"
Dim inputFile As String = Dir & "ZTEST01.SMITH.pdf"
Dim Unmerged As String = Dir & "unmerged\"
Dim Path As String = IO.Path.GetFileNameWithoutExtension(inputFile)
Dim Extracted As String = Path.Substring(0, 7)
' Create Bytescout.PDFExtractor.TextExtractor instance
Dim extractor As New TextExtractor()
' Load sample PDF document
extractor.LoadDocumentFromFile(inputFile)
Dim pageCount As Integer = extractor.GetPageCount()
' Search each page for a keyword
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COVERSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Cover Sheet " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COVERSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Cover Sheet " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COMPLAINTSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Complaint " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COMPLAINTSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Complaint " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "EXHIBITSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Exhibit " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "EXHIBITSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Exhibit " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "MILSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Military " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "SUMSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Summons " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "SUMSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Summons " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
' Cleanup
extractor.Dispose()
Console.WriteLine()
Console.WriteLine("Press any key...")
Console.ReadKey()
End Sub
End Class
如您所见,我只是复制并粘贴了相同的 For i...
循环,并将 Dim pageNumber as Integer i+1 更改为 i +2 以包含其次要页面。
问题在于,有时带有唯一条形码的页面后面的页面数量不确定....
那么,我将如何编写它以便提取,例如:
页面 COVERSPLIT + 所有后续页面 没有 条形码,直到到达下一页 有 条形码(例如,COMPLAINTSPLIT )?
的
而且,我该怎么做才能提取带有条码 COVERSPLIT 的页面及其页面(直到到达下一个条码),但将所有这些页面保存在一个 pdf 中?
您已经注意到您有很多重复的代码。在这种情况下,您可以做的是将在其他方面相同的代码之间变化的一小部分放入一个变量中。
因此,如果我们获得标识页面类型的条形码列表,我们可以遍历它们以找出当前页面的类型。如果没有条形码,那么我们假设页面类型与上一页没有变化。
Option Infer On
Option Strict On
Imports System.IO
Module Module1
Class PageType
Property Identifier As String
Property TypeName As String
End Class
Sub Main()
Dim dir = "G:\Word\Department Folders\Pre-Suit\Drafts-IL-IL_AttyReview18-09\Reviewed\"
Dim inputFile = Path.Combine(dir, "ZTEST01.SMITH.pdf")
Dim unmerged = Path.Combine(dir, "unmerged")
' Set up a list of the identifiers to be searched for and the corresponding names to be used in the filename.
Dim pageTypes As New List(Of PageType)
Dim ids = {"COVERSPLIT", "COMPLAINTSPLIT", "EXHIBITSPLIT", "MILSPLIT", "SUMSPLIT"}
Dim nams = {" Cover Sheet ", " Complaint ", " Exhibit ", " Military ", " Summons "}
For i = 0 To ids.Length - 1
pageTypes.Add(New PageType With {.Identifier = ids(i), .TypeName = nams(i)})
Next
Dim extracted = Path.GetFileNameWithoutExtension(inputFile).Substring(0, 7)
Dim extractor As New TextExtractor()
' Load sample PDF document
extractor.LoadDocumentFromFile(inputFile)
Dim pageCount = extractor.GetPageCount()
Dim currentPageTypeName = "UNKNOWN"
' Search each page for a keyword
For i = 0 To pageCount - 1
' Find the type of the current page
' If it is not present on the page, then the last one found will be used.
For Each pt In pageTypes
If extractor.Find(i, pt.Identifier, False) Then
currentPageTypeName = pt.TypeName
End If
Next
' Extract page
Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}
Dim pageNumber = i + 1 ' (!) page number in ExtractPage() is 1-based
Dim outputfile = Path.Combine(unmerged, extracted & currentPageTypeName & pageNumber & ".pdf")
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber & " to file """ & outputfile & """")
End Using
Next
extractor.Dispose()
Console.WriteLine()
Console.WriteLine("Press any key...")
Console.ReadKey()
End Sub
End Module
我怀疑 Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}
应该在 For 循环之外,这样就不会为每个循环创建和销毁它页。
我重命名了您的 page
变量,因为它干扰了 IO.Path 的简洁使用。最好使用 Path.Combine 方法来组合路径的各个部分,因为它会为您处理路径分隔符。
要将一种类型的所有页面累积到一个文件中,您必须检测类型何时发生变化,然后使用ExtractPageRange 方法。我没有 Bytescout.PDFExtractor 或示例 PDF,所以无法尝试。
到目前为止,我有 VB.NET 个适用于一个文件的代码,它会根据每个页面上的唯一条形码拆分该文件以识别它。
每个条形码是以下之一:
封面分裂
投诉分裂
展览拆分
米尔斯普利特
SUMSPLIT
问题是:比如说,第一页的条码是COVERSPLIT,因为它是封面sheet,但下一页sheet也是封面sheet但是没有上面有条形码。因此,当我 运行 我的代码时,它只提取带有那些已识别条码的 sheet,并忽略那些没有识别的条码。
我试过这样做:
Imports Bytescout.PDFExtractor
Imports System.Collections
Imports System.Collections.Generic
Imports System.IO.Path
Class Program
Friend Shared Sub Main(args As String())
Dim Dir As String = "G:\Word\Department Folders\Pre-Suit\Drafts-IL-IL_AttyReview18-09\Reviewed\"
Dim inputFile As String = Dir & "ZTEST01.SMITH.pdf"
Dim Unmerged As String = Dir & "unmerged\"
Dim Path As String = IO.Path.GetFileNameWithoutExtension(inputFile)
Dim Extracted As String = Path.Substring(0, 7)
' Create Bytescout.PDFExtractor.TextExtractor instance
Dim extractor As New TextExtractor()
' Load sample PDF document
extractor.LoadDocumentFromFile(inputFile)
Dim pageCount As Integer = extractor.GetPageCount()
' Search each page for a keyword
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COVERSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Cover Sheet " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COVERSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Cover Sheet " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COMPLAINTSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Complaint " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "COMPLAINTSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Complaint " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "EXHIBITSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Exhibit " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "EXHIBITSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Exhibit " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "MILSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Military " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "SUMSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 1
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Summons " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
For i As Integer = 0 To pageCount - 1
If extractor.Find(i, "SUMSPLIT", False) Then
' Extract page
Using splitter As New DocumentSplitter()
splitter.OptimizeSplittedDocuments = True
Dim pageNumber As Integer = i + 2
' (!) page number in ExtractPage() is 1-based
Dim outputfile As String = Unmerged & Extracted & " Summons " & pageNumber.ToString() & ".pdf"
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")
End Using
End If
Next
' Cleanup
extractor.Dispose()
Console.WriteLine()
Console.WriteLine("Press any key...")
Console.ReadKey()
End Sub
End Class
如您所见,我只是复制并粘贴了相同的 For i...
循环,并将 Dim pageNumber as Integer i+1 更改为 i +2 以包含其次要页面。
问题在于,有时带有唯一条形码的页面后面的页面数量不确定....
那么,我将如何编写它以便提取,例如:
页面 COVERSPLIT + 所有后续页面 没有 条形码,直到到达下一页 有 条形码(例如,COMPLAINTSPLIT )? 的 而且,我该怎么做才能提取带有条码 COVERSPLIT 的页面及其页面(直到到达下一个条码),但将所有这些页面保存在一个 pdf 中?
您已经注意到您有很多重复的代码。在这种情况下,您可以做的是将在其他方面相同的代码之间变化的一小部分放入一个变量中。
因此,如果我们获得标识页面类型的条形码列表,我们可以遍历它们以找出当前页面的类型。如果没有条形码,那么我们假设页面类型与上一页没有变化。
Option Infer On
Option Strict On
Imports System.IO
Module Module1
Class PageType
Property Identifier As String
Property TypeName As String
End Class
Sub Main()
Dim dir = "G:\Word\Department Folders\Pre-Suit\Drafts-IL-IL_AttyReview18-09\Reviewed\"
Dim inputFile = Path.Combine(dir, "ZTEST01.SMITH.pdf")
Dim unmerged = Path.Combine(dir, "unmerged")
' Set up a list of the identifiers to be searched for and the corresponding names to be used in the filename.
Dim pageTypes As New List(Of PageType)
Dim ids = {"COVERSPLIT", "COMPLAINTSPLIT", "EXHIBITSPLIT", "MILSPLIT", "SUMSPLIT"}
Dim nams = {" Cover Sheet ", " Complaint ", " Exhibit ", " Military ", " Summons "}
For i = 0 To ids.Length - 1
pageTypes.Add(New PageType With {.Identifier = ids(i), .TypeName = nams(i)})
Next
Dim extracted = Path.GetFileNameWithoutExtension(inputFile).Substring(0, 7)
Dim extractor As New TextExtractor()
' Load sample PDF document
extractor.LoadDocumentFromFile(inputFile)
Dim pageCount = extractor.GetPageCount()
Dim currentPageTypeName = "UNKNOWN"
' Search each page for a keyword
For i = 0 To pageCount - 1
' Find the type of the current page
' If it is not present on the page, then the last one found will be used.
For Each pt In pageTypes
If extractor.Find(i, pt.Identifier, False) Then
currentPageTypeName = pt.TypeName
End If
Next
' Extract page
Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}
Dim pageNumber = i + 1 ' (!) page number in ExtractPage() is 1-based
Dim outputfile = Path.Combine(unmerged, extracted & currentPageTypeName & pageNumber & ".pdf")
splitter.ExtractPage(inputFile, outputfile, pageNumber)
Console.WriteLine("Extracted page " & pageNumber & " to file """ & outputfile & """")
End Using
Next
extractor.Dispose()
Console.WriteLine()
Console.WriteLine("Press any key...")
Console.ReadKey()
End Sub
End Module
我怀疑 Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}
应该在 For 循环之外,这样就不会为每个循环创建和销毁它页。
我重命名了您的 page
变量,因为它干扰了 IO.Path 的简洁使用。最好使用 Path.Combine 方法来组合路径的各个部分,因为它会为您处理路径分隔符。
要将一种类型的所有页面累积到一个文件中,您必须检测类型何时发生变化,然后使用ExtractPageRange 方法。我没有 Bytescout.PDFExtractor 或示例 PDF,所以无法尝试。