在没有外部库的情况下从多页 PDF 创建单页 PDF

Question

我看到了以下关于 SO 的问题： Create Multi-Page PDF from other PDFs

但是它没有回复我需要的内容。假设我有一个 20 页的 PDF。到目前为止一切顺利。

从同一个地方，我可以获得只有一页的 PDF。这一个将用作我的模板 PDF。我想要做的是替换模板 PDF 上的内容 (FlateDecodeStream)（以及长度）并生成一个新的单页内容。

我得到了 PDF 文件；然而，一个小标志不显示，adobe reader 说正确显示 PDF 有问题（google chrome 而 edge 只是不显示标志，没有错误消息).

最后我试图弄乱外部参照 table（手动调整值）但得到了相同的结果。

有没有对 PDF 有所了解的人可以给我任何意见？

我正在上传 template_pdf 和另一个我想提取数据并创建第三个 pdf 的文件（使用模板 pdf，但内容来自另一个 PDF）。此外，我将上传我手动制作的 PDF，但显示时出错（它显示数据但没有 JPEG 徽标）。

它的一切都在这里：https://drive.google.com/drive/folders/1tsGIbtbfwuATPQ6a_VPjnxLT4ozzNt0s?usp=sharing

我一直在使用 HxD 做所有事情（查看十六进制内容和 copy\paste 数据）

提前致谢

编辑：我正在添加我当前用于生成 PDF 的代码。它是一个无效的 PDF，即使外部参照 table 没问题（具有正确的位置）。代码非常难看，但现在我希望让它工作（而不是编写一个漂亮的代码）

static void Main(string[] args)
    {

        Console.WriteLine("Hello World!");


        var jpegLogo = File.ReadAllBytes(@"C:\test\Ginfes-Reboot\jpegLogo.raw");
        var pdfStream = File.ReadAllBytes(@"C:\test\Ginfes-Reboot\pdfStream.raw");
        using (BinaryWriter b = new BinaryWriter(
        File.Open(@"C:\test\Ginfes-Reboot\newPdf_newmethod.pdf", FileMode.Create)))
        {
            WritePDFAgain(b,jpegLogo,pdfStream);

        }

    }
    private static void WritePDFAgain(BinaryWriter b, byte[] jpegLogo,byte[] pdfStream)
    {
        List<long> offSets = new List<long>();
        string str = "%PDF-1.4" + "\n";
        var byteArr = Encoding.ASCII.GetBytes(str);
        b.Write(byteArr);
        byteArr = StringToByteArray("25E2E3CFD30A");
        b.Write(byteArr);
        offSets.Add(b.BaseStream.Position);//0
        str = "3 0 obj" + "\n" + "<</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width 60/Length 3857/Height 60/Filter/DCTDecode>>stream" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(jpegLogo);
        b.Write(Encoding.ASCII.GetBytes("\n"));
        b.Write(Encoding.ASCII.GetBytes("endstream" +"\n" + "endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//1
        str = "4 0 obj" + "\n" + "<</Length " + pdfStream.Length + "/Filter/FlateDecode>>stream" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(pdfStream);
        b.Write(Encoding.ASCII.GetBytes("\n"));
        b.Write(Encoding.ASCII.GetBytes("endstream" + "\n" + "endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//2
        str = "1 0 obj" + "\n" + "<</Group<</Type/Group/CS/DeviceRGB/S/Transparency>>/Parent 5 0 R/Contents 4 0 R/Type/Page/Resources<</XObject<</img0 3 0 R>>/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]/ColorSpace<</CS/DeviceRGB>>/Font<</F1 2 0 R>>>>/MediaBox[0 0 595 936]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//3
        str = "6 0 obj" + "\n" + "[1 0 R/XYZ 0 814 0]" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//4
        str = "2 0 obj" + "\n" + "<</BaseFont/Helvetica/Type/Font/Encoding/WinAnsiEncoding/Subtype/Type1>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//5
        str = "5 0 obj" + "\n" + "<</ITXT(2.1.7)/Type/Pages/Count 1/Kids[1 0 R]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//6
        str = "7 0 obj" + "\n" + "<</Names[(JR_PAGE_ANCHOR_0_1) 6 0 R]>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//7
        str = "8 0 obj" + "\n" + "<</Dests 7 0 R>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//8
        str = "9 0 obj" + "\n" + "<</Names 8 0 R/Type/Catalog/ViewerPreferences<</PrintScaling/AppDefault>>/Pages 5 0 R>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        offSets.Add(b.BaseStream.Position);//9
        str = "10 0 obj" + "\n" + @"<</Creator(JasperReports \(nfs_novo\))/Producer(iText 2.1.7 by 1T3XT)/ModDate(D:20191211152903-03'00')/CreationDate(D:20191211152903-03'00')>>" + "\n";
        b.Write(Encoding.ASCII.GetBytes(str));
        b.Write(Encoding.ASCII.GetBytes("endobj" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("xref" + "\n" + "0 11" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("0000000000 65535 f " + "\n"));            
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(2) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(4) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000000"+ offSets.ElementAt(0) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(1) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(5) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("000000" + offSets.ElementAt(3) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(6) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(7) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(8) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("00000" + offSets.ElementAt(9) + " 00000 f " + "\n"));
        b.Write(Encoding.ASCII.GetBytes("trailer" + "\n" + "<</Root 9 0 R/ID [<10a2f7fd162aa44a268ebb6f31cc98c4><c36ebb9dc93cd9a72f229f618092eeb0>]/Info 10 0 R/Size 11>>" + "\n"));
        b.Write(Encoding.ASCII.GetBytes("startxref" + "\n" + (b.BaseStream.Position + 6) + "%%EOF" + "\n"));
    }

使用的文件： https://drive.google.com/drive/folders/1i3J-yioFvcoiakyc_Wi8ddn9g6Pxy7zd?usp=sharing

Answer 1

你已经完成大部分了；您的示例生成的 PDF 的唯一问题是 pdfStream 中引用的图像资源被命名为 img10，而您在创建资源字典时分配的名称是 img0.

下面是一些代码，可以识别正确的引用资源（在页面内容上使用正则表达式），然后您可以在构建字典时使用这些代码。

您需要这些额外的 using 指令：

using System.IO.Compression;
using System.Text.RegularExpressions;

该方法解压页面内容流并匹配图片资源名称：

private static string GetImageResourceName(byte[] pdfStream) {
    using (MemoryStream ms = new MemoryStream(pdfStream)) {                
        ms.Seek(2, SeekOrigin.Begin);   // skip first 2 bytes (zlib header)

        using (DeflateStream ds = new DeflateStream(ms, CompressionMode.Decompress)) {
            using (StreamReader sr = new StreamReader(ds)) {
                string contents = sr.ReadToEnd();

                // PostScript command referencing the image resource looks like: /img123 Do
                return Regex.Match(contents, @"\b(img\d+)\s+Do\b").Groups[1].Value;
            }
        }
    }
}

最后，您只需要在 WritePDFAgain 方法中更改这一行：

str = String.Format(
    "1 0 obj\n<</Group<</Type/Group/CS/DeviceRGB/S/Transparency>>" 
    + "/Parent 5 0 R/Contents 4 0 R/Type/Page/Resources<</XObject" 
    + "<</{0} 3 0 R>>/ProcSet [/PDF /Text /ImageB /ImageC " 
    + "/ImageI]/ColorSpace<</CS/DeviceRGB>>/Font<</F1 2 0 R>>>>" 
    + "/MediaBox[0 0 595 936]>>\n", 
    GetImageResourceName(pdfStream)
);

根据我在评论中的免责声明，此代码仅适用于这种非常具体的情况和输入数据。这绝不是一个通用的解决方案，但我认为你可以接受。

我要重申我的观点，如果您打算不为此使用任何外部库，那么您最终可能会编写自己的库（尽管是一个非常基本的库）。

在没有外部库的情况下从多页 PDF 创建单页 PDF

Create single page PDF from multi page PDF WITHOUT external libraries

c#

pdf

binaryfiles

pdf-scraping