为什么 PDFBox return 图像尺寸为 0 x 0

Question

为了找到 PDF 上图像的实际大小，我使用 PDFBox，并按照 this SO answer 中的描述进行操作。所以基本上我打电话给

 // Computes the image actual location and dimensions
 PrintImageLocations renderer = new PrintImageLocations();

 for (int i = 0; i < pageLimit; ++i) {
        PDPage page = pdf.getPage(i);

        renderer.processPage(page);
 }

并且 PrintImageLocations() 取自 this PDFBox code example。

然而，对于我用于测试的 PDF 文档（由 GPL Ghostscript 910 (ps2write) 从找到的图像 on Wikipedia 生成），报告的图像大小为 0 x 0（尽管可以导入 PDF到 Gimp 或 Libre Office Draw）。

所以我想知道我目前使用的代码是否可靠地查找图像尺寸，以及什么会导致它找不到正确的图像尺寸？

本次测试使用的PDFcan be found here

==========

编辑：在 @Itai 评论之后，似乎 条件 if ("Do".equals(operation)) 未被评估 因为没有调用此类操作。因此调用超级 class 中的 processOperator。

唯一调用的操作是（我在覆盖的 processOperator 方法中的条件之前添加了 System.err.println("Processing " + operation);）：

处理中加工厘米处理gs 加工质量加工重新处理W 加工次数处理组加工重新处理 f 加工工艺处理扫描加工重新处理 f 加工问处理 Q

==========

感谢任何提示，

Answer 1

正如您自己已经发现的那样，输出 0x0 的原因是 PrintImageLocations 原样的代码根本找不到图像。

PrintImageLocations 找不到图像，因为它只查找页面内容中使用的图像以及页面内容中使用的 XObject（也嵌套）表单。另一方面，在手头的文件中，图像绘制在平铺 Pattern 内容中，用于 fill 页面中的区域内容。

为了让 PDFBox 找到这个图像，我们必须扩展 PrintImageLocations class 一点，以便也下降到模式内容流中，例如像这样：

class PrintImageLocationsImproved extends PrintImageLocations {
    public PrintImageLocationsImproved() throws IOException {
        super();

        addOperator(new SetNonStrokingColor());
        addOperator(new SetNonStrokingColorN());
        addOperator(new SetNonStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceGrayColor());
        addOperator(new SetNonStrokingDeviceRGBColor());
        addOperator(new SetNonStrokingColorSpace());
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        String operation = operator.getName();
        if (fillOperations.contains(operation)) {
            PDColor color = getGraphicsState().getNonStrokingColor();
            PDAbstractPattern pattern = getResources().getPattern(color.getPatternName());
            if (pattern instanceof PDTilingPattern) {
                processTilingPattern((PDTilingPattern) pattern, null, null);
            }
        }
        super.processOperator(operator, operands);
    }

    final List<String> fillOperations = Arrays.asList("f", "F", "f*", "b", "b*", "B", "B*");
}

(ExtractImageLocations内classPrintImageLocationsImproved)

手头文档中的平铺图案用作填充而非描边的图案颜色。因此，PrintImageLocationsImproved 必须为非描边颜色运算符注册运算符侦听器，以便在图形状态中正确更新填充颜色。

processOperator 在委托给 PrintImageLocations 实现之前，现在首先检查运算符是否是 fill 操作。在这种情况下，它会检查当前的填充颜色。如果它是图案颜色，processOperator 启动 PDFStreamEngine 中定义的 processTilingPattern 处理，它开始对图案内容流进行嵌套分析，最终让 PrintImageLocationsImproved 找到图像.

像这样使用PrintImageLocationsImproved

try (   PDDocument document = PDDocument.load(...)    )
{
    PrintImageLocations printer = new PrintImageLocationsImproved();
    int pageNum = 0;
    for( PDPage page : document.getPages() )
    {
        pageNum++;
        System.out.println( "Processing page: " + pageNum );
        printer.processPage(page);
    }
}

(ExtractImageLocations 测试 testExtractLikeHelloWorldImprovedFromTopSecret)

因此，对于您的 PDF 文件，将找到图像：

Processing page: 1
*******************************************************************
Found image [R8]
position in PDF = 39.0, 102.48 in user space units
raw image size  = 1209, 1640 in pixels
displayed size  = 516.3119, 700.3752 in user space units
displayed size  = 7.1709986, 9.727433 in inches at 72 dpi rendering
displayed size  = 182.14336, 247.0768 in millimeters at 72 dpi rendering

小心，

这不是完美的修复，更多的是概念验证和解决方法，因为它既没有将图案正确限制到实际填充的区域，也没有 return 多次发现足够大的区域需要多个图案瓷砖来填充。尽管如此，它 return 是手头文件的图像匹配..

为什么 PDFBox return 图像尺寸为 0 x 0

Why does PDFBox return image dimension of size 0 x 0

java

pdfbox

小心，