如何在 java 的 apache PDFBox 库中打开和替换来自 PDF 流的数据?

How open and replace a data from PDF stream in the apache PDFBox lib in java?

我在 java 代码中使用 apache pdfbox 2.0.0 版本 (java 1.6)。 我正在尝试弄清楚如何从

获取、替换并保存回我的 pdf 数据
<stream> data here... <endstream> ?

我的 pdf 文件如下所示:

596 0 obj
<<
/Filter /FlateDecode
/Length 3739
>>
stream
xњ­[ЫnЬF}џoШ8эІАђhЮ/‰`@С%Hvќd-н“іXPJГ ...
endstream
endobj

我找到了如何解码此流的解决方案。我使用了 pdfbox-app-1.8 中的 "WriteDecodedDoc" 命令。10.jar api。 所以现在我有两个文件变体,但我不知道如何使用这个流。 此流包含放置图像和文本的页脚和页眉。

我用 PDFTextStripper class 检查了我的文件。它可以从流中看到必要的数据,但我不能在替换和将数据保存回 pdf 文件的情况下使用这个 class。

我尝试替换此文本只是将文件作为文本打开,搜索文本,仅在流中替换它并保存。但我对 "Cannot extract the embedded font..." 有疑问。主要原因是我松了一个编码。我尝试更改此编码,但没有帮助。

顺便说一句,我无法使用 iText。我应该在这里使用免费库。

感谢任何解决方案。

编辑:

解码后我有像

这样的流
stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Span <</Lang (en-US)/MCID 83 >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
endstream

我需要将 link 替换为另一个 link 内部流。这个:

[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ

编辑 2 代码

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // COSBase cosb = document.getDocument().getObjects().get(27);
            // e.g. this object contains <stream> bytecode <endstream> in the PDF file.
            // it looks that
            // document -> getDocument() -> objectPool #27 -> baseObject -> randomAccess -> bufferList size 10 has a data that I can't open and work
            // document -> getDocument() -> objectPool #27 -> baseObject -> items -> all PDF's tag but NO a stream section

            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                List<Object> tokens = parser.getTokens();
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.linkhouldbehere.com")) {
                                // some magic here to remove all indents and show new link from beginning.
                                // no rules. Just for test and it works here
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding of date from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    newTokens.add(token);
                }

                // save replaced content inside a page
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();
                page.setContents(newContents);

                // replace all links that have a pop-up line
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }
            // save file
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

编辑 3.

pdf包含660 0 obj,其中包含一个必要的link里面:

660 0 obj
<<
/BBox [0.0 792.0 612.0 0.0]
/Length 792
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
>>
/Font <<
/T1_0 834 0 R
/T1_1 835 0 R
/T1_2 836 0 R
>>
/ProcSet [/PDF /Text]
>>
/Subtype /Form
>>
stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Artifact <</O /Layout >>BDC 
BT
/CS0 cs 0.412 0.416 0.423  scn
/T1_0 1 Tf
0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 8 0 0 8 72 64.8 Tm
[(Visit )35(O)7(ur site R)23.1(esear)15.1(ch Manager )20.1(on )20(the )12(web at )]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.lin)-14.9(kshou)-10(ldbeh)-8(ere)-7.9(ninechars)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
EMC 
31.954 0 Td
[(A)15(ugust 7)45.1(,)-5( 2015)]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_0 1 Tf
8 0 0 8 540 64.8 Tm
( )Tj
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_2 1 Tf
7 0 0 7 72 55.3 Tm
[(1 2015 )29(CCH Incorporated and its af7liates. )38.3(All rights r)12(eserv)8.1(ed.)]TJ
ET
EMC 

endstream

而且我只找到一个从 pdf 文件调用它的地方。它来自 45 0 obj

/XObject <<
    /Fm0 660 0 R
    /Fm1 661 0 R
>>

来自 obj 的全文:

45 0 obj
<<
/ArtBox [0.0 0.0 612.0 792.0]
/BleedBox [0.0 0.0 612.0 792.0]
/Contents 658 0 R
/CropBox [0.0 0.0 612.0 792.0]
/Group 659 0 R
/MediaBox [0.0 0.0 612.0 792.0]
/Parent 13 0 R
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
/GS1 23 0 R
>>
/Font <<
/T1_0 597 0 R
/T1_1 26 0 R
/T1_2 28 0 R
/T1_3 25 0 R
>>
/ProcSet [/PDF /Text]
/XObject <<
/Fm0 660 0 R
/Fm1 661 0 R
>>
>>
/Rotate 0
/StructParents 22
/Tabs /W
/Thumb 662 0 R
/TrimBox [0.0 0.0 612.0 792.0]
/Type /Page
/Annots []
>>
endobj

一个问题是我可以得到这个 660 0 obj 并用 PDFBox 处理它吗?因为看起来 PDFStreamParser 解析器对这个 660 0 对象一无所知。 谢谢。

对于 PDFBox 2.0.0-SNAPSHOT。 这是我的代码,在链接替换的情况下对我来说工作正常。

非常感谢 Tilman Hausherr 的帮助。

String filePath = "d:\pdf\file1.pdf"

...

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            // Decrypt a document
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // replace all links in a footer and a header in XObjects with /ProcSet [/PDF /Text]
            // Note: these forms (and pattern objects too!) can have resources,
            // i.e. have Form XObjects or patterns again.
            // If so you need to use a recursion
            for (int pageNum = 0; pageNum < document.getPages().getCount(); pageNum++) {
                List<Object> newPdxTokens = new ArrayList<Object>();
                // Get all XObjects from the page
                Iterable<COSName> xobjs = document.getPage(pageNum).getResources().getXObjectNames();
                for (COSName xobj : xobjs) {
                    boolean isHasTextStream = false;
                    PDXObject pdxObject = document.getPage(pageNum).getResources().getXObject(xobj);
                    // If a stream has not '/ProcSet [/PDF /Text]' line inside it has to be skipped
                    // isXobjectHasTextFieldInPdf has a recursion
                    if (pdxObject.getCOSObject() instanceof COSDictionary) {
                        isHasTextStream = isXobjectHasTextFieldInPdf((COSDictionary) pdxObject.getCOSObject());
                    }

                    if (pdxObject instanceof PDFormXObject && isHasTextStream) {
                        // Set stream from pdxObject
                        PDStream stream = pdxObject.getStream();
                        PDFStreamParser streamParser = new PDFStreamParser(stream.toByteArray());
                        streamParser.parse();
                        for (Object token : streamParser.getTokens()) {
                            if (token instanceof Operator) {
                                Operator op = (Operator) token;
                                if (op.getName().equals("Tj")) {
                                    // Tj contains 1 COSString
                                    COSString previous = (COSString) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = previous.getString();
                                    // here can be any filters for checking a necessary string
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                } else if (op.getName().equals("TJ")) {
                                    // TJ contains a COSArray with COSStrings and COSFloat (padding)
                                    COSArray previous = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = "";
                                    for (int k = 0; k < previous.size(); k++) {
                                        Object arrElement = previous.getObject(k);
                                        if (arrElement instanceof COSString) {
                                            COSString cosString = (COSString) arrElement;
                                            String content = cosString.getString();
                                            string += content;
                                        }
                                    }
                                    // here can be any filters for checking a necessary string
                                    // check if string contains a necessary link
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    } else if (string.startsWith("www.testlink.com")) {
                                        // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                        COSArray newLink = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                        int size = newLink.size();
                                        float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                        for (int i = 0; i < size - 4; i++) {
                                            newLink.remove(0);
                                        }
                                        newLink.set(0, new COSString("test.test.com"));
                                        // number for indenting from right place. Should be checked.
                                        newLink.set(1, new COSFloat(f - 8000));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                }
                            }
                            // save tokens to a temporary List
                            newPdxTokens.add(token);
                        }
                        // save the replaced data back to the srteam
                        OutputStream out = stream.createOutputStream();
                        ContentStreamWriter writer = new ContentStreamWriter(out);
                        writer.writeTokens(newPdxTokens);
                        out.close();
                    }
                }
            }

            // replace data from any text stream from pdf. XObjects not included.
            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                // Get all tokens from the page
                List<Object> tokens = parser.getTokens();
                // Create a temporary List
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.testlink.com")) {
                                // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    // save tokens to a temporary List
                    newTokens.add(token);
                }
                // save the replaced data back to the document's srteam
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();

                // save content
                page.setContents(newContents);

                // replace all links that have a pop-up line (It does not affect the visible text)
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }

            // save document
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

一种仅处理文本流并跳过图像流的额外方法。它是从主要方法 "replaceLinksInPdf(String filePath)"

调用的
        // Check if COSDictionary has '/ProcSet [/PDF /Text]' string in the stream
        private static boolean isXobjectHasTextFieldInPdf(COSDictionary dictionary) {
            boolean isHasTextField = false;
            for (COSBase cosBase : dictionary.getValues()) {
                // go to a recursion because COSDictionary can have COSDictionaries inside
                if (cosBase instanceof COSDictionary) {
                    COSDictionary cosDictionaryNew = (COSDictionary) cosBase;
                    // check if '/ProcSet' has '/Text' param
                    if (cosDictionaryNew.containsKey(COSName.PROC_SET)) {
                        COSBase procSet = cosDictionaryNew.getDictionaryObject(COSName.PROC_SET);
                        if (procSet instanceof COSArray) {
                            for (COSBase procSetIterator : ((COSArray) procSet)) {
                                if (procSetIterator instanceof COSName
                                        && ((COSName) procSetIterator).getName().equals("Text")) {
                                    return true;
                                }
                            }
                        } else if (procSet instanceof COSString && ((COSString) procSet).getString().equals("Text")) {
                            return true;
                        }
                    }
                    // go to the COSDictionary children
                    isHasTextField = isXobjectHasTextFieldInPdf(cosDictionaryNew);
                }
            }
            return isHasTextField;
        }

它只是我项目的一个测试变体。我将使用项目规则重构这段代码。您应该根据需要更改替换件。此外,我使用这个 PDFBox 2.0.0 库大约 1 周,也许任何人都可以找到更简单的方法来编写一些代码。随时进行代码审查和 post 更合适的变体。谢谢

P.S。我已经测试了 40 个 PDF,其中只有 2 个需要在递归的情况下进行深度处理。所有 40 个文件都可以打开,可读,除了链接之外看起来像以前的版本