如何使用 IText 7 从 PDF 中提取页面?

How to Extract pages from a PDF using IText 7?

我尝试使用 IText 7 库从 PDF 文件中提取页面以创建新文件。

static void Splitter() {
            string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 ,514.42 02.12.20.pdf";
            string range = "1, 4, 8";
            var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
            var split = new PdfSplitter(pdfDocumentInvoiceNumber);
            var result = split.ExtractPageRange(new PageRange(range));
            var numberOfPagesPdfDocumentInvoiceNumber = result.GetNumberOfPages();
            String toFile = @"C:\Users\Standard\Downloads\Result\Extracted.pdf";
            var pdfWriter = new PdfWriter(toFile);
            var pdfDocumentInvoiceMergeResult = new PdfDocument(pdfWriter);
           for (var i = 1; i <= numberOfPagesPdfDocumentInvoiceNumber; i++)
            { 
                var pdfPage = result.GetPage(i).CopyTo(pdfDocumentInvoiceMergeResult);
                pdfDocumentInvoiceMergeResult.AddPage(pdfPage);

                }

但是当我尝试使用 CopyTo 时出现错误

iText.Kernel.PdfException: 'Cannot copy indirect object from the document that is being written.'

这里的问题是 return 由 PdfSplitter 方法,特别是 ExtractPageRange 编辑的文档是 写入 [=51] 的 iText 7 文档=],即这些 PdfDocument 实例已使用 PdfWriter.

实例化

此类文档受到某些限制,特别是不能从中复制页面。有关详细信息,请阅读答案 and .

要使这些结果文档(以及整个 PdfSplitter class 具有任何价值,因此,您需要一种方法来定义这些 PdfWriter 对象的位置文件写入。有一种方法,尽管不是一种真正直观的方法:您必须覆盖 PdfSplitterGetNextPdfWriter 方法,它最初看起来像这样:

/// <summary>This method is called when another split document is to be created.</summary>
/// <remarks>
/// This method is called when another split document is to be created.
/// You can override this method and return your own
/// <see cref="iText.Kernel.Pdf.PdfWriter"/>
/// depending on your needs.
/// </remarks>
/// <param name="documentPageRange">the page range of the original document to be included in the document being created now.
///     </param>
/// <returns>the PdfWriter instance for the document which is being created.</returns>
protected internal virtual PdfWriter GetNextPdfWriter(PageRange documentPageRange) {
    return new PdfWriter(new ByteArrayOutputStream());
}

在像您这样的用例中,您只希望最终将一个 return 文档写入文件,您可以这样做:

class MySplitter : PdfSplitter
{
    public MySplitter(PdfDocument pdfDocument) : base(pdfDocument)
    {
    }

    protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange)
    {
        String toFile = @"C:\Users\Standard\Downloads\Result\Extracted.pdf";
        return new PdfWriter(toFile);
    }
}

PdfWriter 实例化移至该自定义拆分器后,您的主要代码将减少为

string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 ,514.42 02.12.20.pdf";
string range = "1, 4, 8";
var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
var split = new MySplitter(pdfDocumentInvoiceNumber);
var result = split.ExtractPageRange(new PageRange(range));
result.Close();

在像您这样的用例中,这确实看起来很奇怪,必须从 PdfSplitter 派生自定义 class 只是为了从源 PDF 中提取几页到结果 PDF。 ExtractPageRange 的附加 PdfWriter 参数不会使它变得更容易吗?

不过请注意,PdfSplitter class 的主要 objective 是使用 ExtractPageRanges 和 [=29= 将文档分成许多部分] 方法,在那种情况下,您需要提供更大的、可能不完全已知的 PdfWriters... 一点也不简单!

当然,更好的解决方案可能是注入一些 lambda 表达式或其他一些回调机制。例如:

class ImprovedSplitter : PdfSplitter
{
    private Func<PageRange, PdfWriter> nextWriter;
    public ImprovedSplitter(PdfDocument pdfDocument, Func<PageRange, PdfWriter> nextWriter) : base(pdfDocument)
    {
        this.nextWriter = nextWriter;
    }

    protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange)
    {
        return nextWriter.Invoke(documentPageRange);
    }
}

你可以这样使用

string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 ,514.42 02.12.20.pdf";
string range = "1, 4, 8";
var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
var split = new ImprovedSplitter(pdfDocumentInvoiceNumber, pageRange => new PdfWriter(@"C:\Users\Standard\Downloads\Result\Extracted.pdf"));
var result = split.ExtractPageRange(new PageRange(range));
result.Close();