如何使用 Apache PDFBox 从按钮中提取标签文本?

How to extract label text from Push button using Apache PDFBox?

假设我设法将 PDTerminalField 转换为 PDPushButton 的实例。 但是查看提供的 API,我无法猜测如何提取所述按钮的标签。

由于应用程序的冗长,未添加代码。 这是一个示例 pdf

(感谢@Tilman 指正。)确实有这样一个属性,你可以通过getAppearanceCharacteristics().getNormalCaption()访问它,但是这个属性是可选的并且不能保证其内容与按钮的视觉外观一致,因为外观流可能包含不同的信息。因此可能需要查询属性和读取外观流的组合策略。

PDF 中按钮的外观流可以包含任意数量的图形和文本绘制指令来绘制按钮,但该流不一定易于阅读或解析。例如。对于 OP 提供的示例文件,此流如下所示:

1 0.75 0.666656 rg
0 0 72 20 re
f
q
1 1 70 18 re
W
n
0 g
BT
/HeBo 12 Tf
0 g
6.696 5.857 Td
(My ) Tj
19.992 0 Td
(Button) Tj
ET
Q

这里已经可以看到按钮文本,"My Button",但显然必须进行一些解析才能检索它(特别是因为文本编码不需要从 ASCII 派生,因为在这种情况下), 必须对流应用文本提取。

不幸的是,PDFBox 中的主要文本提取工具 PdfTextStripper class 很难应用于页面内容以外的任何内容。因此,我使用文本剥离器派生的基础 class,仅添加最少的文本排列功能,并将其应用于按钮外观流。

import java.io.IOException;

import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.contentstream.operator.text.BeginText;
import org.apache.pdfbox.contentstream.operator.text.EndText;
import org.apache.pdfbox.contentstream.operator.text.MoveText;
import org.apache.pdfbox.contentstream.operator.text.MoveTextSetLeading;
import org.apache.pdfbox.contentstream.operator.text.NextLine;
import org.apache.pdfbox.contentstream.operator.text.SetCharSpacing;
import org.apache.pdfbox.contentstream.operator.text.SetFontAndSize;
import org.apache.pdfbox.contentstream.operator.text.SetTextHorizontalScaling;
import org.apache.pdfbox.contentstream.operator.text.SetTextLeading;
import org.apache.pdfbox.contentstream.operator.text.SetTextRenderingMode;
import org.apache.pdfbox.contentstream.operator.text.SetTextRise;
import org.apache.pdfbox.contentstream.operator.text.SetWordSpacing;
import org.apache.pdfbox.contentstream.operator.text.ShowText;
import org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLine;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLineAndSpace;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;

public class SimpleXObjectTextStripper extends PDFStreamEngine {
    public SimpleXObjectTextStripper() {
        addOperator(new BeginText());
        addOperator(new Concatenate());
        addOperator(new DrawObject()); // special text version
        addOperator(new EndText());
        addOperator(new SetGraphicsStateParameters());
        addOperator(new Save());
        addOperator(new Restore());
        addOperator(new NextLine());
        addOperator(new SetCharSpacing());
        addOperator(new MoveText());
        addOperator(new MoveTextSetLeading());
        addOperator(new SetFontAndSize());
        addOperator(new ShowText());
        addOperator(new ShowTextAdjusted());
        addOperator(new SetTextLeading());
        addOperator(new SetMatrix());
        addOperator(new SetTextRenderingMode());
        addOperator(new SetTextRise());
        addOperator(new SetWordSpacing());
        addOperator(new SetTextHorizontalScaling());
        addOperator(new ShowTextLine());
        addOperator(new ShowTextLineAndSpace());
    }

    public String getText(PDFormXObject form) throws IOException {
        stringBuilder.setLength(0);

        processChildStream(form, new PDPage()); 

        return stringBuilder.toString();
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        stringBuilder.append(unicode);
    }

    final StringBuilder stringBuilder = new StringBuilder();
}

(SimpleXObjectTextStripper)

(我包含了 import 语句,因为 PDFBox 在这里包含几个 class 相似名称。)

使用这个简单的自定义剥离器 class,可以像这样从字段外观中提取文本内容:

public void showNormalFieldAppearanceTexts(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();

    if (acroForm != null) {
        SimpleXObjectTextStripper stripper = new SimpleXObjectTextStripper();

        for (PDField field : acroForm.getFieldTree()) {
            if (field instanceof PDTerminalField) {
                PDTerminalField terminalField = (PDTerminalField) field;
                System.out.println();
                System.out.println("* " + terminalField.getFullyQualifiedName());
                for (PDAnnotationWidget widget : terminalField.getWidgets()) {
                    PDAppearanceDictionary appearance = widget.getAppearance();
                    if (appearance != null) {
                        PDAppearanceEntry normal = appearance.getNormalAppearance();
                        if (normal != null) {
                            Map<COSName, PDAppearanceStream> streams = normal.isSubDictionary() ? normal.getSubDictionary() :
                                Collections.singletonMap(COSName.DEFAULT, normal.getAppearanceStream());
                            for (Map.Entry<COSName, PDAppearanceStream> entry : streams.entrySet()) {
                                String text = stripper.getText(entry.getValue());
                                System.out.printf("  * %s: %s\n", entry.getKey().getName(), text);
                            }
                        }
                    }
                }
            }
        }
    }
}

(ExtractAppearanceText辅助方法)