使用 PDFBox 从 PDF 1.7 中提取复选框值
Extract Checkbox value out of PDF 1.7 using PDFBox
我最近开始使用 pdfbox 从 pdf 中提取文本。虽然连同文本我还需要提取图像中显示的复选框值。我尝试了不同的方法来查找复选框元素并提取其值。
通过 this tool 研究 pdf 文本后,我发现复选框不是图像或任何东西,而是以下内容表示的某种图形。
ET
Q
q
BT
/F2 6 Tf
481.3 653.29 Td
( ) Tj
ET
Q
q
1 1 1 rg
484.3 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f
Q
q
BT
/F2 6 Tf
495.55 653.29 Td
(Yes) Tj
ET
Q
q
BT
/F2 6 Tf
504.88 653.29 Td
( ) Tj
ET
Q
q
1 1 1 rg
507.88 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 661.54 m
516.13 661.54 l
516.88 662.29 l
507.88 662.29 l
508.63 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 661.54 m
516.13 654.04 l
516.88 653.29 l
516.88 662.29 l
516.13 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 654.04 m
508.63 654.04 l
507.88 653.29 l
516.88 653.29 l
516.13 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 654.04 m
508.63 661.54 l
507.88 662.29 l
507.88 653.29 l
508.63 654.04 l
f
Q
q
BT
/F2 6 Tf
519.13 653.29 Td
(No) Tj
ET
Q
q
BT
/F2 6 Tf
36.75 642.95 Td
我不确定如何从 pdf 中提取它,我已经看到 pdfbox 提供的不同解析器,但看起来我需要了解有关 pdf 构建方式的更多信息。任何指针将不胜感激。
在评论中您确认
all check boxes and check marks are drawn identically
在您输入的文档中。
因此,要从您的文档中提取复选框及其选中状态,您可以在页面内容中精确搜索在其中绘制框和标记的指令序列,就像示例文档中那样。
如何绘制框和复选标记
正如您已经发现的那样,在问题 1 的“是”框的情况下,通过分别为每个边缘(顶部、右侧、底部、左侧)填充一条路径来绘制框:
485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
...
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
...
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
...
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f
检查文档中的所有框,您可以看到它们的绘图说明遵循以下模式:
A B m
(A+7.5) B l
(A+8.25) (B+0.75) l
(A-0.75) (B+0.75) l
A B l
f
...
C B m
C (B-7.5) l
(C+0.75) (B-8.25) l
(C+0.75) (B+0.75) l
C B l
f
...
C D m
(C-7.5) D l
(C-8.25) (D-0.75) l
(C+0.75) (D-0.75) l
C D l
f
...
A D m
A (D+7.5) l
(A-0.75) (D+8.25) l
(A-0.75) (D-0.75) l
A D l
f
这里A
和C
是盒子的左右x坐标,B
和D
是其顶部和底部 y 坐标。
类似地,勾选标记是通过分别填充两个路径(左半部分和右半部分)来绘制的,对于问题 1 中“是”框中的标记:
0.70711 -0.70711 0.70711 0.70711 -323.79 536.88 cm
...
489.55 661.54 m
489.55 657.79 l
490.3 657.04 l
490.3 661.54 l
489.55 661.54 l
f
...
489.55 657.79 m
488.05 657.79 l
488.05 657.04 l
490.3 657.04 l
489.55 657.79 l
f
检查文档中的所有复选标记,您可以看到它们的绘图说明遵循以下模式:
0.70711 -0.70711 0.70711 0.70711 X Y cm
...
A B m
A (B-3.75) l
(A+0.75) (B-4.5) l
(A+0.75) B l
A B l
f
...
A C m
(A-1.5) C l
(A-1.5) (C-0.75) l
(A+0.75) (C-0.75) l
A C l
f
第一行变换坐标系,绕某点旋转45°;这允许主要使用水平线和垂直线来绘制复选标记。
在此旋转坐标系中,(A,B) 是较长复选标记臂左上角的坐标,(A,C) 是长复选标记臂的两条臂所在直线的最高点的坐标勾选加入。
如何搜索那些指令序列
相关任务已在 的 PdfBoxFinder
class 中实现,class 收集绘制成细长矩形的线,形成网格。
因此,在我们的例子中,我们可以使用相同的基础,即 PDFBox PDFGraphicsStreamEngine
class。我们只需要查看不同类型的路径(由 move-to 和 line-to 指令构建,而不是矩形指令)并且当然以不同的方式处理路径(而不是识别网格,我们必须识别我们特定的复选框和复选标记)。
这样的复选框查找器class可以这样实现:
public class PdfCheckBoxFinder extends PDFGraphicsStreamEngine {
public class CheckBox {
public Point2D getLowerLeft() { return lowerLeft; }
public Point2D getUpperRight() { return upperRight; }
public boolean isChecked() { return checked; }
CheckBox(Point2D lowerLeft, Point2D upperRight, boolean checked) {
this.lowerLeft = lowerLeft;
this.upperRight = upperRight;
this.checked = checked;
}
final Point2D lowerLeft;
final Point2D upperRight;
final boolean checked;
}
public PdfCheckBoxFinder(PDPage page) {
super(page);
for (int i = 0; i < pathAnchorsByType.length; i++)
pathAnchorsByType[i] = new ArrayList<Point2D>();
}
public List<CheckBox> getBoxes() {
if (checkBoxes.isEmpty()) {
for (Point2D anchor : pathAnchorsByType[PathType.boxBottom.index]) {
if (containsApproximatly(pathAnchorsByType[PathType.boxLeft.index], anchor) &&
containsApproximatly(pathAnchorsByType[PathType.boxRight.index], anchor) &&
containsApproximatly(pathAnchorsByType[PathType.boxTop.index], anchor)) {
Point2D upperRight = new Point2D.Float(7.5f + (float)anchor.getX(), 7.5f + (float)anchor.getY());
boolean checked = containsInRectangle(pathAnchorsByType[PathType.checkLeft.index], anchor, upperRight) &&
containsInRectangle(pathAnchorsByType[PathType.checkRight.index], anchor, upperRight);
checkBoxes.add(new CheckBox(anchor, upperRight, checked));
}
}
}
return Collections.unmodifiableList(checkBoxes);
}
boolean containsApproximatly(List<Point2D> points, Point2D anchor) {
for (Point2D point : points) {
if (approximatelyEquals(point.getX(), anchor.getX()) && approximatelyEquals(point.getY(), anchor.getY()))
return true;
}
return false;
}
boolean containsInRectangle(List<Point2D> points, Point2D lowerLeft, Point2D upperRight) {
for (Point2D point : points) {
if (lowerLeft.getX() < point.getX() && point.getX() < upperRight.getX() &&
lowerLeft.getY() < point.getY() && point.getY() < upperRight.getY())
return true;
}
return false;
}
//
// PDFGraphicsStreamEngine overrides
//
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
moveTo((float) p0.getX(), (float) p0.getY());
path.add(new Rectangle(p0, p1, p2, p3));
}
@Override
public void moveTo(float x, float y) throws IOException {
currentPoint = new Point2D.Float(x, y);
currentStartPoint = currentPoint;
}
@Override
public void lineTo(float x, float y) throws IOException {
Point2D point = new Point2D.Float(x, y);
path.add(new Line(currentPoint, point));
currentPoint = point;
}
@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
Point2D point1 = new Point2D.Float(x1, y1);
Point2D point2 = new Point2D.Float(x2, y2);
Point2D point3 = new Point2D.Float(x3, y3);
path.add(new Curve(currentPoint, point1, point2, point3));
currentPoint = point3;
}
@Override
public Point2D getCurrentPoint() throws IOException {
return currentPoint;
}
@Override
public void closePath() throws IOException {
path.add(new Line(currentPoint, currentStartPoint));
currentPoint = currentStartPoint;
}
@Override
public void endPath() throws IOException {
clearPath();
}
@Override
public void strokePath() throws IOException {
clearPath();
}
@Override
public void fillPath(int windingRule) throws IOException {
processPath();
}
@Override
public void fillAndStrokePath(int windingRule) throws IOException {
clearPath();
}
@Override public void drawImage(PDImage pdImage) throws IOException { }
@Override public void clip(int windingRule) throws IOException { }
@Override public void shadingFill(COSName shadingName) throws IOException { }
//
// internal representation of a path
//
interface PathElement {
}
class Rectangle implements PathElement {
final Point2D p0, p1, p2, p3;
Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
this.p0 = p0;
this.p1 = p1;
this.p2 = p2;
this.p3 = p3;
}
}
class Line implements PathElement {
final Point2D p0, p1;
Line(Point2D p0, Point2D p1) {
this.p0 = p0;
this.p1 = p1;
}
}
class Curve implements PathElement {
final Point2D p0, p1, p2, p3;
Curve(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
this.p0 = p0;
this.p1 = p1;
this.p2 = p2;
this.p3 = p3;
}
}
Point2D currentPoint = null;
Point2D currentStartPoint = null;
void clearPath() {
path.clear();
currentPoint = null;
currentStartPoint = null;
}
void processPath() {
for (PathType pathType : PathType.values()) {
if (pathType.matches(path)) {
pathAnchorsByType[pathType.index].add(pathType.getAnchor(path));
}
}
clearPath();
}
enum PathType {
boxTop(new float[] {7.5f, 0f, .75f, .75f, -9f, 0f, .75f, -.75f}, new float[] {0f, -7.5f}, 0),
boxRight(new float[] {0f, -7.5f, .75f, -.75f, 0f, 9f, -.75f, -.75f}, new float[] {-7.5f, -7.5f}, 1),
boxBottom(new float[] {-7.5f, 0f, -.75f, -.75f, 9f, 0f, -.75f, .75f}, new float[] {-7.5f, 0f}, 2),
boxLeft(new float[] {0f, 7.5f, -.75f, .75f, 0f, -9f, .75f, .75f}, new float[] {0f, 0f}, 3),
checkRight(new float[] {-2.65165f, -2.65165f, 0f, -1.06066f, 3.18198f, 3.18198f, -.53033f, .53033f}, new float[] {-2.65165f, -2.65165f/*-5.1072f, -4.4559f*/}, 4),
checkLeft(new float[] {-1.06066f, 1.06066f, -.53033f, -.53033f, 1.59099f, -1.59099f, 0f, 1.06066f}, new float[] {0f, 0f/*-2.4556f, -1.8042f*/}, 5)
;
PathType(float[] diffs, float[] offsetToAnchor, int index) {
this.diffs = diffs;
this.offsetToAnchor = offsetToAnchor;
this.index = index;
}
boolean matches(List<PathElement> path) {
if (path != null && path.size() * 2 == diffs.length) {
for (int i = 0; i < path.size(); i++) {
PathElement element = path.get(i);
if (!(element instanceof Line))
return false;
Line line = (Line) element;
if (!approximatelyEquals(line.p1.getX() - line.p0.getX(), diffs[i*2]))
return false;
if (!approximatelyEquals(line.p1.getY() - line.p0.getY(), diffs[i*2+1]))
return false;
}
return true;
}
return false;
}
Point2D getAnchor(List<PathElement> path) {
if (path != null && path.size() > 0) {
PathElement element = path.get(0);
if (element instanceof Line) {
Line line = (Line) element;
Point2D p = line.p0;
return new Point2D.Float((float)p.getX() + offsetToAnchor[0], (float)p.getY() + offsetToAnchor[1]);
}
}
return null;
}
final float[] diffs;
final float[] offsetToAnchor;
final int index;
}
static boolean approximatelyEquals(double f, double g) {
return Math.abs(f - g) < 0.001;
}
//
// members
//
final List<PathElement> path = new ArrayList<>();
final List<Point2D>[] pathAnchorsByType = new List[PathType.values().length];
final List<CheckBox> checkBoxes = new ArrayList<>();
}
您可以像这样使用 PdfCheckBoxFinder
来查找文档的复选框及其选中状态:
PDDocument document = ...
for (PDPage page : document.getPages())
{
PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
finder.processPage(page);
for (CheckBox checkBox : finder.getBoxes()) {
Point2D ll = checkBox.getLowerLeft();
Point2D ur = checkBox.getUpperRight();
String checked = checkBox.isChecked() ? "checked" : "not checked";
System.out.printf(Locale.ROOT, "* (%4.3f, %4.3f) - (%4.3f, %4.3f) - %s\n", ll.getX(), ll.getY(), ur.getX(), ur.getY(), checked);
}
}
(ExtractCheckBoxes 测试 testExtractFromUpdatedForm
)
对于您的示例 PDF,一个得到
* (485.050, 654.040) - (492.550, 661.540) - checked
* (508.630, 654.040) - (516.130, 661.540) - not checked
* (485.050, 641.760) - (492.550, 649.260) - checked
* (508.630, 641.760) - (516.130, 649.260) - not checked
* (485.050, 629.490) - (492.550, 636.990) - not checked
* (508.630, 629.490) - (516.130, 636.990) - checked
* (485.050, 617.220) - (492.550, 624.720) - checked
* (508.630, 617.220) - (516.130, 624.720) - not checked
* (485.050, 593.700) - (492.550, 601.200) - checked
* (508.630, 593.700) - (516.130, 601.200) - not checked
* (485.050, 581.420) - (492.550, 588.920) - checked
* (508.630, 581.420) - (516.130, 588.920) - not checked
* (485.050, 569.150) - (492.550, 576.650) - checked
* (508.630, 569.150) - (516.130, 576.650) - not checked
* (91.330, 553.500) - (98.830, 561.000) - not checked
* (125.570, 553.500) - (133.070, 561.000) - not checked
* (200.150, 553.500) - (207.650, 561.000) - not checked
* (286.220, 553.500) - (293.720, 561.000) - not checked
* (77.190, 331.430) - (84.690, 338.930) - not checked
(坐标在相关 PDF 页面的裁剪框给出的自然坐标系中。要与 PDFTextStripper
中的坐标相关联,可以转换为文本剥离器的专有坐标系必要的。)
但是请注意,如开头所述,上面的代码仅适用于完全按照您的示例 PDF 构建的复选框和复选标记。您确认会是这种情况,但您可能会感到惊讶。
如果您确实遇到(非常!)一些变体,您可以添加 PathType
匹配所有变体的条目并相应地增强 getBoxes
以识别所有这些变体。
如果您碰巧遇到的不仅仅是几个变体,您应该使用 OCR。
如何将复选框与文本提取相结合
在您提出的评论中
is there a possibility if I can remove the graphics and replate it with some text for an example C or 'N' then I can do text extraction of the newly generated pdf
的确,可以简单地向页面添加用于选中和未选中复选框的文本标记,然后应用文本提取来获取包含标记的文本。不过,我建议使用像 ✔ 和 ✗ 这样的 DingBats。可以这样做:
PDDocument document = ...;
PDType1Font font = PDType1Font.ZAPF_DINGBATS;
for (PDPage page : document.getPages())
{
PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
finder.processPage(page);
for (CheckBox checkBox : finder.getBoxes()) {
Point2D ll = checkBox.getLowerLeft();
Point2D ur = checkBox.getUpperRight();
String checkBoxString = checkBox.isChecked() ? "\u2714" : "\u2717";
try ( PDPageContentStream canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true)) {
canvas.beginText();
canvas.setNonStrokingColor(1, 0, 0);
canvas.setFont(font, (float)(ur.getY()-ll.getY()));
canvas.newLineAtOffset((float)ll.getX(), (float)ll.getY());
canvas.showText(checkBoxString);
canvas.endText();
}
}
}
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
(ExtractCheckBoxes 测试 testExtractInlinedInTextFromUpdatedForm
)
对于您的示例 PDF,一个得到
1. Have you met or discussed with principal life to be assured? ✔ Yes ✗ No
2. Is the principal life to be assured an existing bank customer? ✔ Yes ✗ No
3. Are you related to the proposed Life to be Assured? If yes, please state your relationship with applicant ✗ Yes ✔ No
4. Are you satisfied with the financial standing of the proposed Life to be Assured? ✔ Yes ✗ No
What is the estimated annual income of the Life to be Assured? 600000
...
我最近开始使用 pdfbox 从 pdf 中提取文本。虽然连同文本我还需要提取图像中显示的复选框值。我尝试了不同的方法来查找复选框元素并提取其值。
通过 this tool 研究 pdf 文本后,我发现复选框不是图像或任何东西,而是以下内容表示的某种图形。
ET
Q
q
BT
/F2 6 Tf
481.3 653.29 Td
( ) Tj
ET
Q
q
1 1 1 rg
484.3 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f
Q
q
BT
/F2 6 Tf
495.55 653.29 Td
(Yes) Tj
ET
Q
q
BT
/F2 6 Tf
504.88 653.29 Td
( ) Tj
ET
Q
q
1 1 1 rg
507.88 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 661.54 m
516.13 661.54 l
516.88 662.29 l
507.88 662.29 l
508.63 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 661.54 m
516.13 654.04 l
516.88 653.29 l
516.88 662.29 l
516.13 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 654.04 m
508.63 654.04 l
507.88 653.29 l
516.88 653.29 l
516.13 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 654.04 m
508.63 661.54 l
507.88 662.29 l
507.88 653.29 l
508.63 654.04 l
f
Q
q
BT
/F2 6 Tf
519.13 653.29 Td
(No) Tj
ET
Q
q
BT
/F2 6 Tf
36.75 642.95 Td
我不确定如何从 pdf 中提取它,我已经看到 pdfbox 提供的不同解析器,但看起来我需要了解有关 pdf 构建方式的更多信息。任何指针将不胜感激。
在评论中您确认
all check boxes and check marks are drawn identically
在您输入的文档中。
因此,要从您的文档中提取复选框及其选中状态,您可以在页面内容中精确搜索在其中绘制框和标记的指令序列,就像示例文档中那样。
如何绘制框和复选标记
正如您已经发现的那样,在问题 1 的“是”框的情况下,通过分别为每个边缘(顶部、右侧、底部、左侧)填充一条路径来绘制框:
485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
...
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
...
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
...
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f
检查文档中的所有框,您可以看到它们的绘图说明遵循以下模式:
A B m
(A+7.5) B l
(A+8.25) (B+0.75) l
(A-0.75) (B+0.75) l
A B l
f
...
C B m
C (B-7.5) l
(C+0.75) (B-8.25) l
(C+0.75) (B+0.75) l
C B l
f
...
C D m
(C-7.5) D l
(C-8.25) (D-0.75) l
(C+0.75) (D-0.75) l
C D l
f
...
A D m
A (D+7.5) l
(A-0.75) (D+8.25) l
(A-0.75) (D-0.75) l
A D l
f
这里A
和C
是盒子的左右x坐标,B
和D
是其顶部和底部 y 坐标。
类似地,勾选标记是通过分别填充两个路径(左半部分和右半部分)来绘制的,对于问题 1 中“是”框中的标记:
0.70711 -0.70711 0.70711 0.70711 -323.79 536.88 cm
...
489.55 661.54 m
489.55 657.79 l
490.3 657.04 l
490.3 661.54 l
489.55 661.54 l
f
...
489.55 657.79 m
488.05 657.79 l
488.05 657.04 l
490.3 657.04 l
489.55 657.79 l
f
检查文档中的所有复选标记,您可以看到它们的绘图说明遵循以下模式:
0.70711 -0.70711 0.70711 0.70711 X Y cm
...
A B m
A (B-3.75) l
(A+0.75) (B-4.5) l
(A+0.75) B l
A B l
f
...
A C m
(A-1.5) C l
(A-1.5) (C-0.75) l
(A+0.75) (C-0.75) l
A C l
f
第一行变换坐标系,绕某点旋转45°;这允许主要使用水平线和垂直线来绘制复选标记。
在此旋转坐标系中,(A,B) 是较长复选标记臂左上角的坐标,(A,C) 是长复选标记臂的两条臂所在直线的最高点的坐标勾选加入。
如何搜索那些指令序列
相关任务已在 PdfBoxFinder
class 中实现,class 收集绘制成细长矩形的线,形成网格。
因此,在我们的例子中,我们可以使用相同的基础,即 PDFBox PDFGraphicsStreamEngine
class。我们只需要查看不同类型的路径(由 move-to 和 line-to 指令构建,而不是矩形指令)并且当然以不同的方式处理路径(而不是识别网格,我们必须识别我们特定的复选框和复选标记)。
这样的复选框查找器class可以这样实现:
public class PdfCheckBoxFinder extends PDFGraphicsStreamEngine {
public class CheckBox {
public Point2D getLowerLeft() { return lowerLeft; }
public Point2D getUpperRight() { return upperRight; }
public boolean isChecked() { return checked; }
CheckBox(Point2D lowerLeft, Point2D upperRight, boolean checked) {
this.lowerLeft = lowerLeft;
this.upperRight = upperRight;
this.checked = checked;
}
final Point2D lowerLeft;
final Point2D upperRight;
final boolean checked;
}
public PdfCheckBoxFinder(PDPage page) {
super(page);
for (int i = 0; i < pathAnchorsByType.length; i++)
pathAnchorsByType[i] = new ArrayList<Point2D>();
}
public List<CheckBox> getBoxes() {
if (checkBoxes.isEmpty()) {
for (Point2D anchor : pathAnchorsByType[PathType.boxBottom.index]) {
if (containsApproximatly(pathAnchorsByType[PathType.boxLeft.index], anchor) &&
containsApproximatly(pathAnchorsByType[PathType.boxRight.index], anchor) &&
containsApproximatly(pathAnchorsByType[PathType.boxTop.index], anchor)) {
Point2D upperRight = new Point2D.Float(7.5f + (float)anchor.getX(), 7.5f + (float)anchor.getY());
boolean checked = containsInRectangle(pathAnchorsByType[PathType.checkLeft.index], anchor, upperRight) &&
containsInRectangle(pathAnchorsByType[PathType.checkRight.index], anchor, upperRight);
checkBoxes.add(new CheckBox(anchor, upperRight, checked));
}
}
}
return Collections.unmodifiableList(checkBoxes);
}
boolean containsApproximatly(List<Point2D> points, Point2D anchor) {
for (Point2D point : points) {
if (approximatelyEquals(point.getX(), anchor.getX()) && approximatelyEquals(point.getY(), anchor.getY()))
return true;
}
return false;
}
boolean containsInRectangle(List<Point2D> points, Point2D lowerLeft, Point2D upperRight) {
for (Point2D point : points) {
if (lowerLeft.getX() < point.getX() && point.getX() < upperRight.getX() &&
lowerLeft.getY() < point.getY() && point.getY() < upperRight.getY())
return true;
}
return false;
}
//
// PDFGraphicsStreamEngine overrides
//
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
moveTo((float) p0.getX(), (float) p0.getY());
path.add(new Rectangle(p0, p1, p2, p3));
}
@Override
public void moveTo(float x, float y) throws IOException {
currentPoint = new Point2D.Float(x, y);
currentStartPoint = currentPoint;
}
@Override
public void lineTo(float x, float y) throws IOException {
Point2D point = new Point2D.Float(x, y);
path.add(new Line(currentPoint, point));
currentPoint = point;
}
@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
Point2D point1 = new Point2D.Float(x1, y1);
Point2D point2 = new Point2D.Float(x2, y2);
Point2D point3 = new Point2D.Float(x3, y3);
path.add(new Curve(currentPoint, point1, point2, point3));
currentPoint = point3;
}
@Override
public Point2D getCurrentPoint() throws IOException {
return currentPoint;
}
@Override
public void closePath() throws IOException {
path.add(new Line(currentPoint, currentStartPoint));
currentPoint = currentStartPoint;
}
@Override
public void endPath() throws IOException {
clearPath();
}
@Override
public void strokePath() throws IOException {
clearPath();
}
@Override
public void fillPath(int windingRule) throws IOException {
processPath();
}
@Override
public void fillAndStrokePath(int windingRule) throws IOException {
clearPath();
}
@Override public void drawImage(PDImage pdImage) throws IOException { }
@Override public void clip(int windingRule) throws IOException { }
@Override public void shadingFill(COSName shadingName) throws IOException { }
//
// internal representation of a path
//
interface PathElement {
}
class Rectangle implements PathElement {
final Point2D p0, p1, p2, p3;
Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
this.p0 = p0;
this.p1 = p1;
this.p2 = p2;
this.p3 = p3;
}
}
class Line implements PathElement {
final Point2D p0, p1;
Line(Point2D p0, Point2D p1) {
this.p0 = p0;
this.p1 = p1;
}
}
class Curve implements PathElement {
final Point2D p0, p1, p2, p3;
Curve(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
this.p0 = p0;
this.p1 = p1;
this.p2 = p2;
this.p3 = p3;
}
}
Point2D currentPoint = null;
Point2D currentStartPoint = null;
void clearPath() {
path.clear();
currentPoint = null;
currentStartPoint = null;
}
void processPath() {
for (PathType pathType : PathType.values()) {
if (pathType.matches(path)) {
pathAnchorsByType[pathType.index].add(pathType.getAnchor(path));
}
}
clearPath();
}
enum PathType {
boxTop(new float[] {7.5f, 0f, .75f, .75f, -9f, 0f, .75f, -.75f}, new float[] {0f, -7.5f}, 0),
boxRight(new float[] {0f, -7.5f, .75f, -.75f, 0f, 9f, -.75f, -.75f}, new float[] {-7.5f, -7.5f}, 1),
boxBottom(new float[] {-7.5f, 0f, -.75f, -.75f, 9f, 0f, -.75f, .75f}, new float[] {-7.5f, 0f}, 2),
boxLeft(new float[] {0f, 7.5f, -.75f, .75f, 0f, -9f, .75f, .75f}, new float[] {0f, 0f}, 3),
checkRight(new float[] {-2.65165f, -2.65165f, 0f, -1.06066f, 3.18198f, 3.18198f, -.53033f, .53033f}, new float[] {-2.65165f, -2.65165f/*-5.1072f, -4.4559f*/}, 4),
checkLeft(new float[] {-1.06066f, 1.06066f, -.53033f, -.53033f, 1.59099f, -1.59099f, 0f, 1.06066f}, new float[] {0f, 0f/*-2.4556f, -1.8042f*/}, 5)
;
PathType(float[] diffs, float[] offsetToAnchor, int index) {
this.diffs = diffs;
this.offsetToAnchor = offsetToAnchor;
this.index = index;
}
boolean matches(List<PathElement> path) {
if (path != null && path.size() * 2 == diffs.length) {
for (int i = 0; i < path.size(); i++) {
PathElement element = path.get(i);
if (!(element instanceof Line))
return false;
Line line = (Line) element;
if (!approximatelyEquals(line.p1.getX() - line.p0.getX(), diffs[i*2]))
return false;
if (!approximatelyEquals(line.p1.getY() - line.p0.getY(), diffs[i*2+1]))
return false;
}
return true;
}
return false;
}
Point2D getAnchor(List<PathElement> path) {
if (path != null && path.size() > 0) {
PathElement element = path.get(0);
if (element instanceof Line) {
Line line = (Line) element;
Point2D p = line.p0;
return new Point2D.Float((float)p.getX() + offsetToAnchor[0], (float)p.getY() + offsetToAnchor[1]);
}
}
return null;
}
final float[] diffs;
final float[] offsetToAnchor;
final int index;
}
static boolean approximatelyEquals(double f, double g) {
return Math.abs(f - g) < 0.001;
}
//
// members
//
final List<PathElement> path = new ArrayList<>();
final List<Point2D>[] pathAnchorsByType = new List[PathType.values().length];
final List<CheckBox> checkBoxes = new ArrayList<>();
}
您可以像这样使用 PdfCheckBoxFinder
来查找文档的复选框及其选中状态:
PDDocument document = ...
for (PDPage page : document.getPages())
{
PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
finder.processPage(page);
for (CheckBox checkBox : finder.getBoxes()) {
Point2D ll = checkBox.getLowerLeft();
Point2D ur = checkBox.getUpperRight();
String checked = checkBox.isChecked() ? "checked" : "not checked";
System.out.printf(Locale.ROOT, "* (%4.3f, %4.3f) - (%4.3f, %4.3f) - %s\n", ll.getX(), ll.getY(), ur.getX(), ur.getY(), checked);
}
}
(ExtractCheckBoxes 测试 testExtractFromUpdatedForm
)
对于您的示例 PDF,一个得到
* (485.050, 654.040) - (492.550, 661.540) - checked
* (508.630, 654.040) - (516.130, 661.540) - not checked
* (485.050, 641.760) - (492.550, 649.260) - checked
* (508.630, 641.760) - (516.130, 649.260) - not checked
* (485.050, 629.490) - (492.550, 636.990) - not checked
* (508.630, 629.490) - (516.130, 636.990) - checked
* (485.050, 617.220) - (492.550, 624.720) - checked
* (508.630, 617.220) - (516.130, 624.720) - not checked
* (485.050, 593.700) - (492.550, 601.200) - checked
* (508.630, 593.700) - (516.130, 601.200) - not checked
* (485.050, 581.420) - (492.550, 588.920) - checked
* (508.630, 581.420) - (516.130, 588.920) - not checked
* (485.050, 569.150) - (492.550, 576.650) - checked
* (508.630, 569.150) - (516.130, 576.650) - not checked
* (91.330, 553.500) - (98.830, 561.000) - not checked
* (125.570, 553.500) - (133.070, 561.000) - not checked
* (200.150, 553.500) - (207.650, 561.000) - not checked
* (286.220, 553.500) - (293.720, 561.000) - not checked
* (77.190, 331.430) - (84.690, 338.930) - not checked
(坐标在相关 PDF 页面的裁剪框给出的自然坐标系中。要与 PDFTextStripper
中的坐标相关联,可以转换为文本剥离器的专有坐标系必要的。)
但是请注意,如开头所述,上面的代码仅适用于完全按照您的示例 PDF 构建的复选框和复选标记。您确认会是这种情况,但您可能会感到惊讶。
如果您确实遇到(非常!)一些变体,您可以添加 PathType
匹配所有变体的条目并相应地增强 getBoxes
以识别所有这些变体。
如果您碰巧遇到的不仅仅是几个变体,您应该使用 OCR。
如何将复选框与文本提取相结合
在您提出的评论中
is there a possibility if I can remove the graphics and replate it with some text for an example C or 'N' then I can do text extraction of the newly generated pdf
的确,可以简单地向页面添加用于选中和未选中复选框的文本标记,然后应用文本提取来获取包含标记的文本。不过,我建议使用像 ✔ 和 ✗ 这样的 DingBats。可以这样做:
PDDocument document = ...;
PDType1Font font = PDType1Font.ZAPF_DINGBATS;
for (PDPage page : document.getPages())
{
PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
finder.processPage(page);
for (CheckBox checkBox : finder.getBoxes()) {
Point2D ll = checkBox.getLowerLeft();
Point2D ur = checkBox.getUpperRight();
String checkBoxString = checkBox.isChecked() ? "\u2714" : "\u2717";
try ( PDPageContentStream canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true)) {
canvas.beginText();
canvas.setNonStrokingColor(1, 0, 0);
canvas.setFont(font, (float)(ur.getY()-ll.getY()));
canvas.newLineAtOffset((float)ll.getX(), (float)ll.getY());
canvas.showText(checkBoxString);
canvas.endText();
}
}
}
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
(ExtractCheckBoxes 测试 testExtractInlinedInTextFromUpdatedForm
)
对于您的示例 PDF,一个得到
1. Have you met or discussed with principal life to be assured? ✔ Yes ✗ No
2. Is the principal life to be assured an existing bank customer? ✔ Yes ✗ No
3. Are you related to the proposed Life to be Assured? If yes, please state your relationship with applicant ✗ Yes ✔ No
4. Are you satisfied with the financial standing of the proposed Life to be Assured? ✔ Yes ✗ No
What is the estimated annual income of the Life to be Assured? 600000
...