使用 pdfbox 从 pdf 中删除不可见的文本

remove invisible text from pdf using pdfbox

Link to pdf

当我尝试从上面的 pdf 中提取文本时,我得到了在 evince 查看器中不可见的文本和可见文本的混合体。此外,一些所需的文本缺少查看器中未丢失的字符,例如 'FALCONS' 中的 'S' 和许多丢失的“½”字符。我认为这是由于不可见文本的干扰,因为在查看器中突出显示 pdf 时,可以看到不可见文本与可见文本重叠。

有没有办法去除不可见的文字?或者有其他解决方案吗?

代码:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


public class App {

    public static String getPdfText(String pdfPath) throws IOException {
        File file = new File(pdfPath);
        PDDocument document = null;
        PDFTextStripper textStripper = null;
        String text = null;

        try {
            document = PDDocument.load(file);
            textStripper = new PDFTextStripper();
            textStripper.setEndPage(1);
            text =  textStripper.getText(document);
        } catch (IOException e) {
            throw new IOException("Could not load file and strip text.", e);
        } finally {
            try {
                if (document != null)
                    document.close();
            } catch (IOException e) {
                System.out.println("Could not close document");
            }
        }

        return text;
    }

    public static void main(String[] args) {
        String filename = "RevTeaser09072016.pdf";
        String text = null;

        try {
            text = getPdfText(filename);
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }

        System.out.println(text);
    }
}

输出(粗体文本是所需的文本):

145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
精选
队伍数量
投注金额
REVERSE TASER 卡片
如图所示标记框
 表示主队
职业足球 - 2012 年 11 月 15 日,星期四
1 BILLS ★ NFL PM8:25 2 海豚 7– ½ 6– ½
职业足球 - 2012 年 11 月 18 日,星期日
3 红皮队 ★ PM1:00 4 老鹰队 10– ½ 3– ½
5 包装工 PM1:00 6 雄狮 ★10– ½ 3– ½
7 名猎鹰队 ★ PM1:00 8 名红雀队 17– ½ 3+ ½
9 海盗 PM1:00 10 黑豹 ★7– ½ 6– ½
11 名牛仔队 ★ PM1:00 12 名布朗队 14– ½ + ½
13 RAMS ★ PM1:00 14 JETS10– ½ 3– ½
15 名爱国者队 ★ PM4:25 16 名小马队 17– ½ 3+ ½
17 名德州人 ★ PM1:00 18 名美洲虎 23– ½ 9+ ½
19 支孟加拉虎队 PM1:00 20 支酋长队 ★10– ½ 3– ½
21 名圣徒 PM4:05 22 名袭击者 ★12– ½ 1– ½
23 野马 ★ PM4:25 24 充电器14– ½ + ½
25 乌鸦 NBC PM8:30 26 钢人 ★7– ½ 6– ½
职业足球 - 2012 年 11 月 19 日,星期一
27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
精选
队伍数量
投注金额
REVERSE TASER 卡片
将框标记为 hown
 表示主队
职业足球 - 2012 年 11 月 15 日,星期四
1 BILLS ★ NFL PM8:25 2 海豚 7– ½ 6– ½
职业足球 - 2012 年 11 月 18 日,星期日
3 红皮队 ★ PM1:00 4 老鹰队 10– ½ 3– ½
5 包装工 PM1:00 6 雄狮 ★10– ½ 3– ½
7 名猎鹰队 ★ PM1:00 8 名红雀队 17– ½ 3+ ½
9 海盗 PM1:00 10 黑豹 ★7– ½ 6– ½
11 名牛仔队 ★ PM1:00 12 名布朗队 14– ½ + ½
13 RAMS ★ PM1:00 14 JETS10– ½ 3– ½
15 名爱国者队 ★ PM4:25 16 名小马队 17– ½ 3+ ½
17 名德州人 ★ PM1:00 18 名美洲虎 23– ½ 9+ ½
19 支孟加拉虎队 PM1:00 20 支酋长队 ★10– ½ 3– ½
21 名圣徒 PM4:05 22 名袭击者 ★12– ½ 1– ½
23 野马 ★ PM4:25 24 充电器14– ½ + ½
25 乌鸦 NBC PM8:30 26 钢制 RS ★7– ½ 6– ½
职业足球 - 2012 年 11 月 19 日,星期一
27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
14
160
41
15715 156154150 153149 152148 51147
142
158
50
146
选集
队伍数量
投注金额

ark box as sho n 显示
 表示主队
职业足球 - 2012 年 11 月 15 日,星期四
1 BILLS ★ NFL PM8:25 2 海豚 7– ½ 6– ½
职业足球 - 2012 年 11 月 18 日,星期日
3 红皮队 ★ PM1:0 4 老鹰队 10– ½ 3– ½
5 包装工 PM1:0 6 雄狮 ★10– ½ 3– ½
7 名猎鹰队 ★ PM1:0 8 名红雀队 17– ½ 3+ ½
9 BU CANEERS PM1:0 10 PANTHERS ★7– ½ 6– ½
11 名牛仔队 ★ PM1:0 12 名布朗队 14– ½ + ½
13 RAMS ★ PM1:0 14 JETS10– ½ 3– ½
15 名爱国者队 ★ PM4:25 16 名小马队 17– ½ 3+ ½
17 名德州人 ★ PM1:0 18 名美洲虎 23– ½ 9+ ½
19 支孟加拉虎队 PM1:0 20 支酋长队 ★10– ½ 3– ½
21 名圣徒 PM4:05 22 名袭击者 ★12– ½ 1– ½
23 野马 ★ PM4:25 24 充电器14– ½ + ½
25 乌鸦 NBC PM8:30 26 钢人 ★7– ½ 6– ½
职业足球 - 2012 年 11 月 19 日,星期一
27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½
1,0
标记框如图所示 
表示主队
职业足球 - 2016 年 9 月 8 日,星期四
 1 黑豹队 nbc - 10½ 8:30p 2 野马队  - 3½
 职业足球 - 2016 年 9 月 11 日,星期日
  猎鹰  - 9 1:00p 4 海盗 - 4½
 5 维京人 - 9½ 1:00p 6 泰坦  - 4½
 7 老鹰  - 10½ 1:00p 8 布朗 - 3½
 9 孟加拉虎 - 9½ 1:00p 10 喷气机  - 4½
 11 名圣徒  - 7½ 1:00p 12 名袭击者 - 6½
 13 个酋长  - 14½ 1:00p 14 个充电器 + ½
 15 乌鸦  - 10½ 1:00p 16 票据 - 3½
 17 个德克萨斯人  - 14 1:00p 18 个熊市 + ½
 19 包装工 - 12 1:00p 20 美洲虎  - 1½
 21 只海鹰  - 17½ 4:05p 22 只海豚 + 3½
 23 个牛仔  - 7½ 4:25p 24 个巨人 - 6½
 25 小马  - 10½ 4:25p 26 雄狮 - 3½
 27 名红雀  nbc - 14½ 8:30p 28 名爱国者 + ½
 职业足球 - 2016 年 9 月 12 日,星期一
 29 名钢人队 espn - 10½ 7:10p 30 名红人队  - 3½
 31 RAMS espn - 9 10:20p 32 49ERS  - 4½

OP 示例 PDF 中的不可见文本 主要是通过定义剪辑路径(在文本所在的边界之外)不可见 ) 和填充路径(隐藏下面的文本)。因此,我们必须在文本提取过程中考虑与路径相关的指令,以忽略 不可见文本 .

不幸的是,为这些指令设计的回调没有在 PDFTextStripper 或其父 类 LegacyPDFStreamEnginePDFStreamEngine 中声明。

但它们在其他主要 PDFStreamEngine 子类 PDFGraphicsStreamEngine 中声明,并且在 PageDrawer.

中得到合理实现

因此,为了利用这一点,我们可以将 PageDrawer 实现复制并粘贴并调整到 PDFTextStripper 的子类中,例如像这样:

public class PDFVisibleTextStripper extends PDFTextStripper {
    public PDFVisibleTextStripper() throws IOException {
        addOperator(new AppendRectangleToPath());
        addOperator(new ClipEvenOddRule());
        addOperator(new ClipNonZeroRule());
        addOperator(new ClosePath());
        addOperator(new CurveTo());
        addOperator(new CurveToReplicateFinalPoint());
        addOperator(new CurveToReplicateInitialPoint());
        addOperator(new EndPath());
        addOperator(new FillEvenOddAndStrokePath());
        addOperator(new FillEvenOddRule());
        addOperator(new FillNonZeroAndStrokePath());
        addOperator(new FillNonZeroRule());
        addOperator(new LineTo());
        addOperator(new MoveTo());
        addOperator(new StrokePath());
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        Matrix textMatrix = text.getTextMatrix();
        Vector start = textMatrix.transform(new Vector(0, 0));
        Vector end = new Vector(start.getX() + text.getWidth(), start.getY());

        PDGraphicsState gs = getGraphicsState();
        Area area = gs.getCurrentClippingPath();
        if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
            super.processTextPosition(text);
    }

    private GeneralPath linePath = new GeneralPath();

    void deleteCharsInPath() {
        for (List<TextPosition> list : charactersByArticle) {
            List<TextPosition> toRemove = new ArrayList<>();
            for (TextPosition text : list) {
                Matrix textMatrix = text.getTextMatrix();
                Vector start = textMatrix.transform(new Vector(0, 0));
                Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
                if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
                    toRemove.add(text);
                }
            }
            if (toRemove.size() != 0) {
                System.out.println(toRemove.size());
                list.removeAll(toRemove);
            }
        }
    }

    public final class AppendRectangleToPath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x = (COSNumber) operands.get(0);
            COSNumber y = (COSNumber) operands.get(1);
            COSNumber w = (COSNumber) operands.get(2);
            COSNumber h = (COSNumber) operands.get(3);

            float x1 = x.floatValue();
            float y1 = y.floatValue();

            // create a pair of coordinates for the transformation
            float x2 = w.floatValue() + x1;
            float y2 = h.floatValue() + y1;

            Point2D p0 = context.transformedPoint(x1, y1);
            Point2D p1 = context.transformedPoint(x2, y1);
            Point2D p2 = context.transformedPoint(x2, y2);
            Point2D p3 = context.transformedPoint(x1, y2);

            // to ensure that the path is created in the right direction, we have to create
            // it by combining single lines instead of creating a simple rectangle
            linePath.moveTo((float) p0.getX(), (float) p0.getY());
            linePath.lineTo((float) p1.getX(), (float) p1.getY());
            linePath.lineTo((float) p2.getX(), (float) p2.getY());
            linePath.lineTo((float) p3.getX(), (float) p3.getY());

            // close the subpath instead of adding the last line so that a possible set line
            // cap style isn't taken into account at the "beginning" of the rectangle
            linePath.closePath();
        }

        @Override
        public String getName() {
            return "re";
        }
    }

    public final class StrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.reset();
        }

        @Override
        public String getName() {
            return "S";
        }
    }

    public final class FillEvenOddRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "f*";
        }
    }

    public class FillNonZeroRule extends OperatorProcessor {
        @Override
        public final void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "f";
        }
    }

    public final class FillEvenOddAndStrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "B*";
        }
    }

    public class FillNonZeroAndStrokePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            deleteCharsInPath();
            linePath.reset();
        }

        @Override
        public String getName() {
            return "B";
        }
    }

    public final class ClipEvenOddRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
            getGraphicsState().intersectClippingPath(linePath);
        }

        @Override
        public String getName() {
            return "W*";
        }
    }

    public class ClipNonZeroRule extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
            getGraphicsState().intersectClippingPath(linePath);
        }

        @Override
        public String getName() {
            return "W";
        }
    }

    public final class MoveTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 2) {
                throw new MissingOperandException(operator, operands);
            }
            COSBase base0 = operands.get(0);
            if (!(base0 instanceof COSNumber)) {
                return;
            }
            COSBase base1 = operands.get(1);
            if (!(base1 instanceof COSNumber)) {
                return;
            }
            COSNumber x = (COSNumber) base0;
            COSNumber y = (COSNumber) base1;
            Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
            linePath.moveTo(pos.x, pos.y);
        }

        @Override
        public String getName() {
            return "m";
        }
    }

    public class LineTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 2) {
                throw new MissingOperandException(operator, operands);
            }
            COSBase base0 = operands.get(0);
            if (!(base0 instanceof COSNumber)) {
                return;
            }
            COSBase base1 = operands.get(1);
            if (!(base1 instanceof COSNumber)) {
                return;
            }
            // append straight line segment from the current point to the point
            COSNumber x = (COSNumber) base0;
            COSNumber y = (COSNumber) base1;

            Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());

            linePath.lineTo(pos.x, pos.y);
        }

        @Override
        public String getName() {
            return "l";
        }
    }

    public class CurveTo extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 6) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x1 = (COSNumber) operands.get(0);
            COSNumber y1 = (COSNumber) operands.get(1);
            COSNumber x2 = (COSNumber) operands.get(2);
            COSNumber y2 = (COSNumber) operands.get(3);
            COSNumber x3 = (COSNumber) operands.get(4);
            COSNumber y3 = (COSNumber) operands.get(5);

            Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
            Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "c";
        }
    }

    public final class CurveToReplicateFinalPoint extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x1 = (COSNumber) operands.get(0);
            COSNumber y1 = (COSNumber) operands.get(1);
            COSNumber x3 = (COSNumber) operands.get(2);
            COSNumber y3 = (COSNumber) operands.get(3);

            Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "y";
        }
    }

    public class CurveToReplicateInitialPoint extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            if (operands.size() < 4) {
                throw new MissingOperandException(operator, operands);
            }
            if (!checkArrayTypesClass(operands, COSNumber.class)) {
                return;
            }
            COSNumber x2 = (COSNumber) operands.get(0);
            COSNumber y2 = (COSNumber) operands.get(1);
            COSNumber x3 = (COSNumber) operands.get(2);
            COSNumber y3 = (COSNumber) operands.get(3);

            Point2D currentPoint = linePath.getCurrentPoint();

            Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
            Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

            linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
        }

        @Override
        public String getName() {
            return "v";
        }
    }

    public final class ClosePath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.closePath();
        }

        @Override
        public String getName() {
            return "h";
        }
    }

    public final class EndPath extends OperatorProcessor {
        @Override
        public void process(Operator operator, List<COSBase> operands) throws IOException {
            linePath.reset();
        }

        @Override
        public String getName() {
            return "n";
        }
    }
}

(PDFVisibleTextStripper)

请确保在 PDFVisibleTextStripper 构造函数中使用内部运算符 类,而不是 PageDrawer 使用的同名 类 .要确保只需遵循代码下的 link。

这会将输出减少到

REVERSE tEaSER caRd
500
elections
er of Teams
t Bet
1,000
MARK BOX AS SHOWN 
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
 1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
 PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
 3 FALCONS     - 9½ 1:00p 4 BUCCANEERS  - 4½
 5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
 7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
 9 BENGALS - 9½ 1:00p 10 JETS  - 4½
 11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
 13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
 15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
 17 TEXANS  - 14½ 1:00p 18 BEARS + ½
 19 PACKERS - 12½ 1:00p 20 JAGUARS  - 1½
 21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
 23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
 25 COLTS     - 10½ 4:25p 26 LIONS - 3½
 27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
 PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
 29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
 31 RAMS  espn  - 9½ 10:20p 32 49ERS  - 4½

这会丢弃大部分不需要的数据。


的上下文中,很明显 processTextPositiondeleteCharsInPath 计算字符基线结束的方式隐含地假定没有页面旋转的水平文本.但是,如果放宽 "Visibility" 的标准,则可以假设一个字符是可见的,前提是其基线的开始是可见的。在那种情况下,人们不再需要计算 Vector end 并且代码也适用于旋转页面。


的上下文中,很明显,由于浮点计算错误,正好位于裁剪路径边界上的字形原点坐标可能会在裁剪路径之外徘徊。切换到 "fat point coordinate checks" 结果是一个可以接受的解决方法。