使用 PDFBox 将资源中的 FormXobject 内容添加到内容流?

Add FormXobject content from resources to content stream using PDFBox?

我的page1->Resource -> Xobjects->Fm0, Fm1, Fm2..下有FormXobject..

所以不是contents->contentstream下不可用的direct content stream。所以我想将内容流从Fm0->Contentstream移动到page1->contents->contentstream.

当我们像这样移动内容流时,我们必须并行传输或复制 Fm0 相关资源到页面级资源。

1.Content 流需要在页面级内容下复制。

2.Color space 对象需要复制到 page1->Resource->Colorspace.

3.ExtGState对象需要复制到page1->Resource->ExtGState.

4.properties需要复制到page1->Resource下(这里需要完全创建)

我尝试了一些代码

private PDDocument parseFormXobject(PDDocument document) throws IOException {
PDDocument newdocument = new PDDocument();
for (int pg_ind = 0; pg_ind < document.getNumberOfPages(); pg_ind++) {
    List<Object> tokens1 = (List<Object>) (getTokens(document)).get(pg_ind);
    PDStream newContents = new PDStream(document);
    OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
    ContentStreamWriter writer = new ContentStreamWriter(out);

    PDPage pageinner = document.getPage(pg_ind);
    PDResources resources = pageinner.getResources();
    PDResources new_resources = new PDResources();
    new_resources = resources;

    COSDictionary fntdict = new COSDictionary();
    COSDictionary imgdict = new COSDictionary();
    COSDictionary extgsdict = new COSDictionary();
    COSDictionary colordict = new COSDictionary();
    int img_count = 0;
    for (COSName xObjectName : resources.getXObjectNames()) {
        PDXObject  xObject = resources.getXObject(xObjectName);
        if (xObject instanceof PDFormXObject) {

            PDFStreamParser parser = new PDFStreamParser(((PDFormXObject) xObject).getContentStream());
            parser.parse();
            List<Object>  tokens3 = parser.getTokens();
            int ind =0;
            System.out.println(xObjectName.getName());
                for (COSName colorname :((PDFormXObject) xObject).getResources().getColorSpaceNames())
                {
                    COSName new_name = COSName.getPDFName(colorname.getName()+"_Fm"+img_count);
                    PDColorSpace pdcolor = ((PDFormXObject) xObject).getResources().getColorSpace(colorname);
                    colordict.setItem(new_name,pdcolor);
                }
                for (COSName fontName :((PDFormXObject) xObject).getResources().getFontNames() )
                {
                    COSName new_name = COSName.getPDFName(fontName.getName()+"_Fm"+img_count);
                    PDFont font =((PDFormXObject) xObject).getResources().getFont(fontName);
                    font.getCOSObject().setItem(COSName.NAME, new_name);
                    fntdict.setItem(new_name,font);
                }
                for (COSName ExtGSName :((PDFormXObject) xObject).getResources().getExtGStateNames() )
                {
                    COSName new_name = COSName.getPDFName(ExtGSName.getName()+"_Fm"+img_count);
                    PDExtendedGraphicsState ExtGState =((PDFormXObject) xObject).getResources().getExtGState(ExtGSName);
                    ExtGState.getCOSObject().setItem(COSName.NAME, new_name);
                    extgsdict.setItem(new_name,ExtGState);
                }
                imgdict.setItem(xObjectName, xObject);
                for (COSName Imgname :((PDFormXObject) xObject).getResources().getXObjectNames() )
                {
                    COSName new_name = COSName.getPDFName(Imgname.getName()+"_Fm"+img_count);
                    xObject.getCOSObject().setItem(COSName.NAME, new_name);
                    PDXObject img =((PDFormXObject) xObject).getResources().getXObject(Imgname);
                    imgdict.setItem(new_name, img);
                }

                    for (int k=0; k< tokens1.size(); k++) {
                        if ( ((tokens1.get(k) instanceof Operator) && ((Operator)tokens1.get(k)).getName().toString().equals("Do"))
                                && ((COSName)tokens1.get(k-1)).getName().toString().equals(xObjectName.getName().toString()) ) {
                            System.out.println(tokens1.get(k).toString());
                            tokens1.remove(k-1);
                            tokens1.remove(k-1);
                            ind =k-1;
                            break;
                        }
                    }
                for (int k=0; k< tokens3.size(); k++) {
                    if ( (tokens3.size() > k+1) && (tokens3.get(k+1) instanceof Operator) && (((Operator)tokens3.get(k+1)).getName().toString().equals("Do")
                            || ((Operator)tokens3.get(k+1)).getName().toString().equals("gs")
                            || ((Operator)tokens3.get(k+1)).getName().toString().equals("cs")  ) ) {
                        COSName new_name = COSName.getPDFName( ((COSName) tokens3.get(k)).getName()+"_Fm"+img_count );
                        tokens1.add(ind+k, new_name );
                    }else if ( (tokens3.size() > k+2) && (tokens3.get(k+2) instanceof Operator)
                            && ((Operator)tokens3.get(k+2)).getName().toString().equals("Tf") ) {
                        COSName new_name = COSName.getPDFName( ((COSName) tokens3.get(k)).getName()+"_Fm"+img_count );
                        tokens1.add(ind+k, new_name );
                    }else
                        tokens1.add(ind+k,tokens3.get(k));
                }

                img_count +=1;
        }else
            imgdict.setItem(xObjectName, xObject);
    }
    for (COSName fontName :new_resources.getFontNames() )
    {
        PDFont font =new_resources.getFont(fontName);
        fntdict.setItem(fontName,font);
    }
    for (COSName ExtGSName :new_resources.getExtGStateNames() )
    {
        PDExtendedGraphicsState extg =new_resources.getExtGState(ExtGSName);
        extgsdict.setItem(ExtGSName,extg);
    }
    for (COSName colorname :new_resources.getColorSpaceNames() )
    {
        PDColorSpace color =new_resources.getColorSpace(colorname);
        colordict.setItem(colorname,color);
    }
    resources.getCOSObject().setItem(COSName.EXT_G_STATE,extgsdict);
    resources.getCOSObject().setItem(COSName.FONT,fntdict);
    resources.getCOSObject().setItem(COSName.XOBJECT,imgdict);
    resources.getCOSObject().setItem(COSName.COLORSPACE, colordict);

    writer.writeTokens(tokens1);
    out.close();
    document.getPage(pg_ind).setContents(newContents);
    document.getPage(pg_ind).setMediaBox(PDFUtils.Media_box);
    document.getPage(pg_ind).setResources(resources);
    newdocument.addPage(document.getPage(pg_ind));
}
newdocument.save("D:/Testfiles/stu.pdf");
return newdocument;
}

但我无法获得精确的页面图形。我失去了一些东西。

input pdf

output pdf

有多个问题,一些在细节上,一些在概念上。

包裹在 save-graphics-state/restore-graphics-state 信封中

当您绘制 XObject 时,该 XObject 中的图形状态更改不会更改您当前的图形状态。为确保在将 XObject 指令复制到页面内容流后仍然如此,您必须将该块包装到 save-graphics-state/restore-graphics-state 信封中(q ... Q)。您可以通过添加这两行来做到这一点

tokens1.add(ind++, Operator.getOperator("q"));
tokens1.add(ind, Operator.getOperator("Q"));

就在你的指令复制循环之前

for (int k=0; k< tokens3.size(); k++) {
    ...
}

坐标系

您假定 XObject 中的坐标系等于页面的坐标系。不一定。 XObjects 可能有一个 Matrix 条目表示要应用的转换。

边界框

您不限制 XObject 指令绘制的区域。但是 XObjects 有一个 BBox 条目,表示要将输出剪辑到的框。

可选内容

XObjects 也可能有一个 OC 条目,表示它们的可选内容成员资格。这样的成员资格需要转换为等效的可选内容标记。

标注内容,结构树

XObjects 还可以通过它们的 StructParentStructParents 条目引用结构父树。为了保持文档的结构完整性,您可能必须大幅更新结构树。

分组

XObjects 可能包含一个 Group 条目,指示其内容应被视为一个组。特别是在透明组的情况下,这会导致透明度相关功能的行为与复制到页面内容中的相同指令不同。

除非你完全分析每一位绘制的具有一定透明度的内容的效果,并根据具体情况重写绘制它的指令,否则将指令从 XObject 复制到页面内容流将导致显示的显着差异内容。

用法

您的代码假定 XObject 在页面内容流中仅使用一次。这个不一定,也可以多用或者不用。


参考资料

在您要求参考的评论中。其实都在PDF规范ISO 32000中,已经在公开的ISO 32000-1中了:

8.10 Form XObjects

A form XObject is a PDF content stream that is a self-contained description of any sequence of graphics objects (including path objects, text objects, and sampled images). A form XObject may be painted multiple times—either on several pages or at several locations on the same page—and produces the same results each time, subject only to the graphics state at the time it is invoked.

因此,在给定页面上可以使用任意数量的用法

When the Do operator is applied to a form XObject, a conforming reader shall perform the following tasks:

a) Saves the current graphics state, as if by invoking the q operator (see 8.4.4, "Graphics State Operators")

b) Concatenates the matrix from the form dictionary’s Matrix entry with the current transformation matrix (CTM)

c) Clips according to the form dictionary’s BBox entry

d) Paints the graphics objects specified in the form’s content stream

e) Restores the saved graphics state, as if by invoking the Q operator (see 8.4.4, "Graphics State Operators")

因此,当复制到页面内容流中时,您应该等效地使用 q/Q 信封并尊重 矩阵BBox个条目。

8.11.3.3 Optional Content in XObjects and Annotations

In addition to marked content within content streams, form XObjects and image XObjects (see 8.8, "External Objects") and annotations (see 12.5, "Annotations") may contain an OC entry, which shall be an optional content group or an optional content membership dictionary.

A form or image XObject's visibility shall be determined by the state of the group or those of the groups referenced by the membership dictionary in conjunction with its P (or VE) entry, along with the current visibility state in the context in which the XObject is invoked (that is, whether objects are visible in the contents stream at the place where the Do operation occurred).

因此,在复制到页面内容时,请尊重此可选内容信息。

11.6.6 Transparency Group XObjects

A transparency group is represented in PDF as a special type of group XObject (see “Group XObjects”) called a transparency group XObject. A group XObject is in turn a type of form XObject, distinguished by the presence of a Group entry in its form dictionary (see “Form Dictionaries”). The value of this entry is a subsidiary group attributes dictionary defining the properties of the group. The format and meaning of the dictionary’s contents shall be determined by its group subtype, which is specified by the dictionary’s S entry. The entries for a transparency group (subtype Transparency) are shown in Table 147.

...

Annex L

因此从透明组复制可能会显着改变外观。

14.7.4.3 PDF Objects as Content Items

When a structure element’s content includes an entire PDF object, such as an XObject or an annotation, that is associated with a page but not directly included in the page’s content stream, the object shall be identified in the structure element’s K entry by an object reference dictionary (see Table 325).

...

14.7.4.4 Finding Structure Elements from Content Items

...

To locate the relevant parent tree entry, each object or content stream that is represented in the tree shall contain a special dictionary entry, StructParent or StructParents (see Table 326). Depending on the type of content item, this entry may appear in the page object of a page containing marked-content sequences, in the stream dictionary of a form or image XObject, in an annotation dictionary, or in any other type of object dictionary that is included as a content item in a structure element.

这一章和同一章中的更多信息应该清楚地表明,从 XObject 复制到页面内容后的结构信息必须进行大修。