为什么我的 PDF 中有不可见字符，如何使用 PDFBox 过滤掉它们？

Question

我正在使用 PDFBox 通过扩展 PDFTextStripper 从文档中提取文本。我注意到其中一些文档包含正在提取的不可见字符。我想过滤掉这些不可见的字符。

我看到已经有一些关于此的 Whosebug 帖子，例如：

我尝试在此处找到 PDFVisibleTextStripperclass 子class：

https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/main/java/mkl/testarea/pdfbox2/extract/PDFVisibleTextStripper.java

但是，我发现这过滤掉了实际上可见的文本。我用它作为 PDFTextStripper.

的替代品

package com.example.foo;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;

public class ExtractChars extends PDFVisibleTextStripper {
  Processor processor;

  public static void extract(PDDocument document, Processor processor) throws IOException {
    ExtractChars instance = new ExtractChars();

    instance.processor = processor;
    instance.setSortByPosition(true);
    instance.setStartPage(0);
    instance.setEndPage(document.getNumberOfPages());

    ByteArrayOutputStream stream = new ByteArrayOutputStream();
    Writer streamWriter = new OutputStreamWriter(stream);

    instance.writeText(document, streamWriter);
  }

  ExtractChars() throws IOException {}

  protected void writeString(String _string, List<TextPosition> textPositions) throws IOException {
    for (TextPosition text: textPositions) {
      float height = text.getHeightDir();
      String character = text.getUnicode();

      int pageIndex = getCurrentPageNo() - 1;
      float left = text.getXDirAdj();
      float right = left + text.getWidthDirAdj();
      float bottom = text.getYDirAdj();
      float top = bottom - height;

      BoundingBox box = new BoundingBox(pageIndex, left, right, top, bottom);

      this.processor.process(character, box);
    }
  }

  public interface Processor {
    void process(String character, BoundingBox box);
  }
}

我不知道我的 subclass 中是否需要更改任何内容才能使其正常工作。如果有帮助，我可以提供展示此行为的 PDF，尽管它包含敏感内容，因此我需要先将其删除。

相反，我创建了一个最小示例（如下）来展示我所看到的 'invisible text' 行为。项目符号列表在 '24 末尾包含一个项目。一种。'可以在 PDF 查看器（例如 macOS 预览）中突出显示并复制粘贴。

此 'a.' 当前正在被 PDFTextStripper 提取，我不希望它被提取。我真的不明白为什么会这样。我的猜测是它与剪裁有关，但如果有人能解释发生了什么，我将不胜感激。

我的最终目标是过滤掉这些字符，因此如果您对我如何以最简单的方式处理这种特定情况有任何建议，我们将不胜感激。我认为我不需要 PDFVisibleTextStripper.

中的所有通用方法

非常感谢！

%PDF-1.3

1 0 obj
<<
  /Type /Catalog
  /Pages 2 0 R
>>
endobj

2 0 obj
<<
  /Type /Pages
  /Kids [3 0 R]
  /Count 1
  /MediaBox [0 0 612 792]
>>
endobj

3 0 obj
<<
  /Type /Page
  /Parent 2 0 R
  /Resources 4 0 R
  /Contents 6 0 R
  /MediaBox [0 0 612 792]
>>
endobj

4 0 obj
<<
  /Font <<
    /TT2 5 0 R
  >>
>>
endobj

5 0 obj
<<
  /BaseFont
  /OXRDVC+Helvetica
  /Subtype /TrueType
  /Type /Font
>>
endobj

6 0 obj
<<
>>
stream
q 0 54 612 648 re W n /Cs1 cs 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm Q
q 48 93.30545 516 569.4218 re W n /Cs1 cs 1 1 1 sc 48 93.30545 516 569.4218 re f 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 66.86 589.28 Tm /TT2 1 Tf (24.  ) Tj ET Q
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 96.86 40.39 Tm /TT2 1 Tf (a.  ) Tj ET Q 
endstream
endobj

trailer
<<
  /Root 1 0 R
>>

%%EOF

Answer 1

我明白是怎么回事了。 PDF 包含一个不包含 'a.' 的剪切矩形。我尝试使用 PDFVisibleTextStripper，但在其他文档的其他地方删除了实际上可见的文本。

最后，我写了一个class，它继承了PageDrawer，实现了showGlyph方法来访问页面上正在绘制的字符。此方法检查字符的边界框是否在 getGraphicsState().getCurrentClippingPath().getBounds2D().

之外

不幸的是，这意味着我不再使用 PDFTextStripper，所以我不得不重新实现它的一些行为，例如按位置对字符进行排序（我使用的是 setSortByPosition(true)）。根据字体大小和位移计算正确的字符边界框也有点棘手。

ExtractChars.java

package com.example.foo;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.font.*;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import org.apache.pdfbox.util.Vector;
import java.awt.geom.*;
import java.io.*;

// This class effectively renders the PDF document in order to extract its
// text. It intercepts the showGlyph function provided by PageDrawer. We used to
// use PDFTextStripper but that has no way to exclude clipped characters.

public class ExtractChars extends PageDrawerHelper {
  // Skip erroneous characters smaller than this height. This might never happen
  // but there are places in the code that divide by height, so guard against it.
  static final float MIN_CHARACTER_HEIGHT = 0.01f;

  Processor processor;

  ExtractChars(PageDrawerParameters params, float pageHeight, int pageIndex, Processor processor) throws IOException {
    super(params, pageHeight, pageIndex);
    this.processor = processor;
  }

  // We can't move this method up to the superclass because the Renderer is
  // different each time. It needs to build an instance of the current class.
  public static void extract(PDDocument document, Processor processor) throws IOException {
    Renderer renderer = new Renderer(document);
    renderer.processor = processor;

    for (int i = 0; i < document.getNumberOfPages(); i += 1) {
      PDPage page = document.getPage(i);

      renderer.pageHeight = page.getMediaBox().getHeight();
      renderer.pageIndex = i;
      renderer.renderImage(i);
    }
  }

  @Override
  public void showGlyph(Matrix matrix, PDFont font, int _code, String unicode, Vector displacement) throws IOException {
    if (unicode == null) { return; }

    // Get the width and height of the character relative to font size.
    // The height does not change but the width does, e.g. 'M' is wider than 'I'.
    float width = displacement.getX();
    float height = fontHeight(font) / 2;

    BoundingBox charBox = clippedBoundingBox(matrix, width, height);

    // Skip the character if it is outside the clipping region and not visible.
    if (charBox == null) { return; }

    float boxHeight = charBox.bottom - charBox.top;
    if (boxHeight < MIN_CHARACTER_HEIGHT) { return; }

    // We need the text direction so we can sort text in separate buckets based on this.
    int direction = textDirection(matrix);

    processor.process(unicode, charBox, direction);
  }

  // 
  float fontHeight(PDFont font) {
    return font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000;
  }

  int textDirection(Matrix matrix) {
    float a = matrix.getValue(0, 0);
    float b = matrix.getValue(0, 1);
    float c = matrix.getValue(1, 0);
    float d = matrix.getValue(1, 1);

    // This logic is copied from:
    // https://github.com/atsuoishimoto/pdfbox-ja/blob/master/src/main/java/org/apache/pdfbox/util/TextPosition.java
    if ((a > 0) && (Math.abs(b) < d) && (Math.abs(c) < a) && (d > 0)) {
      return 0;
    } else if ((a < 0) && (Math.abs(b) < Math.abs(d)) && (Math.abs(c) < Math.abs(a)) && (d < 0)) {
      return 180;
    } else if ((Math.abs(a) < Math.abs(c)) && (b > 0) && (c < 0) && (Math.abs(d) < b)) {
      return 90;
    } else if ((Math.abs(a) < c) && (b < 0) && (c > 0) && (Math.abs(d) < Math.abs(b))) {
      return 270;
    }

    return 0;
  }

  // We can't construct an instance of ExtractChars directly because its
  // constructor requires PageDrawerParameters which is private to the package.
  // Instead, make an instance via a renderer and forward the fields to it.
  static class Renderer extends PDFRenderer {
    Processor processor;
    float pageHeight;
    int pageIndex;

    Renderer(PDDocument document) {
      super(document);
    }

    protected PageDrawer createPageDrawer(PageDrawerParameters params) throws IOException {
      return new ExtractChars(params, pageHeight, pageIndex, processor);
    }
  }

  public interface Processor {
    void process(String character, BoundingBox box, int direction);
  }
}

PageDrawerHelper.java

package com.example.foo;

import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import java.awt.geom.*;
import java.io.*;

// This class provides utility methods to subclasses, mostly so they can check
// if the currently content is being clipped and therefore should be skipped.
//
// We shouldn't really use inheritance for sharing code but this has the
// advantage of being able to call some methods of the PageDrawer superclass.

public class PageDrawerHelper extends PageDrawer {
  float pageHeight;
  int pageIndex;

  PageDrawerHelper(PageDrawerParameters params, float pageHeight, int pageIndex) throws IOException {
    super(params);

    this.pageHeight = pageHeight;
    this.pageIndex = pageIndex;
  }

  // Gets the bounding for a matrix by transforming corner points and taking the
  // min/max values in the x- and y-directions. This ensures rotation and skew
  // are taken into account. This method can return null if content is clipped.
  BoundingBox clippedBoundingBox(Matrix matrix, float width, float height) {
    Point2D p0 = matrix.transformPoint(0, 0);
    Point2D p1 = matrix.transformPoint(0, height);
    Point2D p2 = matrix.transformPoint(width, 0);
    Point2D p3 = matrix.transformPoint(width, height);

    BoundingBox contentBox = boundingBox(p0, p1, p2, p3);
    BoundingBox clippedBox = applyClipping(contentBox);

    return clippedBox;
  }

  BoundingBox boundingBox(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
    Point2D topLeft = topLeft(p0, p1, p2, p3);
    Point2D botRight = botRight(p0, p1, p2, p3);

    float left = (float)topLeft.getX();
    float right = (float)botRight.getX();
    float top = pageHeight - (float)botRight.getY();
    float bottom = pageHeight - (float)topLeft.getY();

    return new BoundingBox(pageIndex, left, right, top, bottom);
  }

  Point2D topLeft(Point2D... points) {
    double minX = points[0].getX();
    double minY = points[0].getY();

    for (int i = 1; i < points.length; i += 1) {
      minX = Math.min(minX, points[i].getX());
      minY = Math.min(minY, points[i].getY());
    }

    return new Point2D.Double(minX, minY);
  }

  Point2D botRight(Point2D... points) {
    double maxX = points[0].getX();
    double maxY = points[0].getY();

    for (int i = 1; i < points.length; i += 1) {
      maxX = Math.max(maxX, points[i].getX());
      maxY = Math.max(maxY, points[i].getY());
    }

    return new Point2D.Double(maxX, maxY);
  }

  BoundingBox applyClipping(BoundingBox box) {
    Rectangle2D clip = getGraphicsState().getCurrentClippingPath().getBounds2D();

    float clipLeft = (float)clip.getMinX();
    float clipRight = (float)clip.getMaxX();
    float clipTop = pageHeight - (float)clip.getMaxY();
    float clipBottom = pageHeight - (float)clip.getMinY();

    float left = Math.max(box.left, clipLeft);
    float right = Math.min(box.right, clipRight);
    float top = Math.max(box.top, clipTop);
    float bottom = Math.min(box.bottom, clipBottom);

    if (left >= right || top >= bottom) {
      return null;
    } else {
      return new BoundingBox(pageIndex, left, right, top, bottom);
    }
  }
}

CharacterSorter.java

package com.example.foo;

import java.util.*;

public class CharacterSorter {
  ArrayList<String> characters;
  ArrayList<BoundingBox> boxes;
  ArrayList<Integer> directions;

  public CharacterSorter(ArrayList<String> characters, ArrayList<BoundingBox> boxes, ArrayList<Integer> directions) {
    this.characters = characters;
    this.boxes = boxes;
    this.directions = directions;
  }

  public void sortByDirectionThenPosition() {
    ArrayList<Tuple> tuples = new ArrayList();

    for (int i = 0; i < characters.size(); i += 1) {
      tuples.add(new Tuple(characters.get(i), boxes.get(i), directions.get(i)));
    }

    Collections.sort((List)tuples);
    characters.clear(); boxes.clear(); directions.clear();

    for (Tuple tuple: tuples) {
      characters.add(tuple.character);
      boxes.add(tuple.box);
      directions.add(tuple.direction);
    }
  }

  // This helper class wraps the three fields associated with a single character
  // and provides a comparator function which mimics how PDFTextStripper orders
  // its characters when #setSortByPosition(true) is set.
  class Tuple implements Comparable {
    String character;
    BoundingBox box;
    Integer direction;

    Tuple(String character, BoundingBox box, Integer direction) {
      this.character = character;
      this.box = box;
      this.direction = direction;
    }

    public int compareTo(Object o) {
      Tuple other = (Tuple)o;

      int primary = ((Integer)box.pageIndex).compareTo(other.box.pageIndex);
      if (primary != 0) { return primary; }

      // The remainder of this logic is copied and adapted from:
      // https://github.com/apache/pdfbox/blob/a78f4a2ea058181e5ed05d6367ba7556948331b8/pdfbox/src/main/java/org/apache/pdfbox/text/TextPositionComparator.java#L29-L70

      // Only compare text that is in the same direction.
      int secondary = Float.compare(direction, other.direction);
      if (secondary != 0) { return secondary; }

      // Get the text direction adjusted coordinates.
      float x1 = box.left;
      float x2 = other.box.left;

      float pos1YBottom = box.bottom;
      float pos2YBottom = other.box.bottom;

      // Note that the coordinates have been adjusted so (0, 0) is in upper left.
      float pos1YTop = pos1YBottom - (box.bottom - box.top);
      float pos2YTop = pos2YBottom - (other.box.bottom - other.box.top);

      float yDifference = Math.abs(pos1YBottom - pos2YBottom);

      // We will do a simple tolerance comparison.
      if (yDifference < .1 ||
          pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
          pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
      {
          return Float.compare(x1, x2);
      } else if (pos1YBottom < pos2YBottom) {
          return -1;
      } else {
          return 1;
      }
    }
  }
}

为什么我的 PDF 中有不可见字符，如何使用 PDFBox 过滤掉它们？

Why are there invisible characters in my PDF and how do I filter them out with PDFBox?

java

pdfbox