Apache PdfBox：关于坐标的混淆

Question

我尝试从 PDF 中提取一些文本。为此，我需要定义一个包含文本的矩形。

当我比较从文本提取的坐标到绘图坐标时，我意识到坐标可能具有不同的含义。

package MyTest.MyTest;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.PDPageContentStream.*;
import org.apache.pdfbox.text.*;
import java.awt.*;
import java.io.*;

public class MyTest 
{   
  public static void main (String [] args) throws Exception
  { 
    PDDocument pd = PDDocument.load (new File ("my.pdf"));  
    PDFTextStripperByArea st = new PDFTextStripperByArea ();
    PDPage pg = pd.getPage (0);

    float h = pg.getMediaBox ().getHeight ();
    float w = pg.getMediaBox ().getWidth ();
    System.out.println (h + " x " + w + " in internal units");
    h = h / 72 * 2.54f * 10;
    w = w / 72 * 2.54f * 10;
    System.out.println (h + " x " + w + " in mm");



    int X = 85;
    int Y = 175;
    int dX = 250;
    int dY = 15;

    // extract some text
    st.addRegion ("a", new Rectangle (X, Y, dX, dY));
    st.extractRegions (pg);
    String text = st.getTextForRegion ("a");
    System.out.println("text="+text);


    // fill a rectangle
    PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
    contents.setNonStrokingColor (Color.RED);  
    contents.addRect (X, Y, dX, dY);
    contents.fill ();
    contents.close ();
    pd.save ("x.pdf");
  }
}

我提取的文本（控制台中 text= 的输出）不是我用红色矩形透支的文本（生成 x.pdf）。

为什么？？

为了测试，尝试一些您已有的 PDF。为避免大量 try/error 瞄准其中包含文本的矩形，请使用包含大量文本的文件。

Answer 1

您的方法（至少）有两个问题：

坐标系不同

您使用st.addRegion。它的 JavaDoc 注释告诉我们：

/**
 * Add a new region to group text by.
 *
 * @param regionName The name of the region.
 * @param rect The rectangle area to retrieve the text from. The y-coordinates are java
 * coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
 */
public void addRegion( String regionName, Rectangle2D rect )

（其实PDFBox的整个文本提取工具都是使用自己的坐标系，已经有很多关于stack overflow的问题，因为这引起的骚动。）

另一方面，contents.addRect 不使用那些“java 坐标”。因此，您必须从最大裁剪框 y 坐标中减去您在文本提取中使用的 y 坐标以获得 addRect.

的坐标

此外，区域矩形的锚点位于左上角，而常规 PDF 矩形（如您使用 contents.addRect 定义的矩形）位于左下角。因此，您还必须从 y 坐标中添加或减去矩形高度。

实际上，您可能还需要更改 x 坐标。它不是镜像的，但可能有偏移，PDFBox 文本提取坐标系使用 x=0 作为左页边框，但 PDF 用户 space 不一定是这种情况。因此，您可能必须将裁剪框的左边框 x 坐标添加到文本提取 x 坐标。

可能改变了坐标系

在页面内容流中，可能已通过对当前变换矩阵应用变换来更改坐标系。因此，您附加到它的说明中的坐标可能具有与上面概述的不同的含义。

要排除这种影响，您应该使用带有附加 boolean resetContext 参数的不同 PDPageContentStream 构造函数：

/**
 * Create a new PDPage content stream.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @param appendContent Indicates whether content will be overwritten, appended or prepended.
 * @param compress Tell if the content stream should compress the page contents.
 * @param resetContext Tell if the graphic context should be reset. This is only relevant when
 * the appendContent parameter is set to {@link AppendMode#APPEND}. You should use this when
 * appending to an existing stream, because the existing stream may have changed graphic
 * properties (e.g. scaling, rotation).
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
                           boolean compress, boolean resetContext) throws IOException

即替换

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);

由

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false, false);

Apache PdfBox：关于坐标的混淆

Apache PdfBox: Confusion about coordinates

java

pdfbox

坐标系不同

可能改变了坐标系