如何使用 PDFBox 在 pdf 中查找 table 边框线?
How to find table border lines in pdf using PDFBox?
我正在尝试在 pdf 中查找 table 边框线。我用了pdfBox的PrintTextLocations
class来造字。现在我正在寻找构成 table 的不同线条的坐标。我尝试使用 org.apache.pdfbox.pdfviewer.PageDrawer
,但找不到任何包含这些行的 character/graphic。我尝试了两种方法:
第一个:
Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);
它给了我空指针异常。所以其次,我试图覆盖 processStream
功能,但我无法获得任何中风。请帮帮我。我愿意使用任何其他库,它会给我 table 中的行坐标。另一个快速问题,pdfbox 中的那些 table 边框线是什么类型的对象?这些是图形还是字符?
这是我尝试解析的示例 pdf 的 link:
http://stats.bls.gov/news.release/pdf/empsit.pdf
并尝试获取第 8 页上的 table 行。
Edit :我遇到了另一个问题,在解析此 pdf 的第 1 页时,我无法获得任何行作为 printPath()
中的 pathIterator
函数为空,尽管为每一行调用了 strokePath()
函数。如何使用此 pdf?
在 1.8.* 版本中,PDFBox 解析功能的实现方式不是很通用,特别是 OperatorProcessor
实现与特定解析器 classes 紧密相关,例如处理路径绘制操作的实现假定与 PageDrawer
实例交互。
因此,除非有人想要复制和粘贴所有那些 OperatorProcessor
classes 并进行微小的更改,否则必须从这样一个特定的解析器派生 class.
因此,在您的情况下,我们也将从 PageDrawer
派生我们的解析器,毕竟我们 对路径绘制操作感兴趣:
public class PrintPaths extends PageDrawer
{
//
// constructor
//
public PrintPaths() throws IOException
{
super();
}
//
// method overrides for mere path observation
//
// ignore text
@Override
protected void processTextPosition(TextPosition text) { }
// ignore bitmaps
@Override
public void drawImage(Image awtImage, AffineTransform at) { }
// ignore shadings
@Override
public void shFill(COSName shadingName) throws IOException { }
@Override
public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
{
PDRectangle cropBox = aPage.findCropBox();
this.pageSize = cropBox.createDimension();
super.processStream(aPage, resources, cosStream);
}
@Override
public void fillPath(int windingRule) throws IOException
{
printPath();
System.out.printf("Fill; windingrule: %s\n\n", windingRule);
getLinePath().reset();
}
@Override
public void strokePath() throws IOException
{
printPath();
System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
getLinePath().reset();
}
void printPath()
{
GeneralPath path = getLinePath();
PathIterator pathIterator = path.getPathIterator(null);
double x = 0, y = 0;
double coords[] = new double[6];
while (!pathIterator.isDone()) {
switch (pathIterator.currentSegment(coords)) {
case PathIterator.SEG_MOVETO:
System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_LINETO:
double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_QUADTO:
System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
x = coords[2];
y = coords[3];
break;
case PathIterator.SEG_CUBICTO:
System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
x = coords[4];
y = coords[5];
break;
case PathIterator.SEG_CLOSE:
System.out.println("Close path");
}
pathIterator.next();
}
}
double getEffectiveWidth(double dirX, double dirY)
{
if (dirX == 0 && dirY == 0)
return 0;
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
double widthX = dirY;
double widthY = -dirX;
double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
return getGraphicsState().getLineWidth() * factor;
}
}
因为我们不想实际绘制页面而只是提取将被绘制的路径,我们必须剥离PageDrawer
像这样。
此示例解析器输出路径绘制操作以展示如何执行此操作。显然,您可以改为收集它们以进行自动化处理...
您可以像这样使用解析器:
PDDocument document = PDDocument.load(resource);
List<?> allPages = document.getDocumentCatalog().getAllPages();
int i = 7; // page 8
System.out.println("\n\nPage " + (i+1));
PrintPaths printPaths = new PrintPaths();
PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null)
{
printPaths.processStream(page, page.findResources(), page.getContents().getStream());
}
输出为:
Page 8
Move to (35.92070007324219 724.6490478515625)
Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 694.4660034179688)
Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (292.2610168457031 468.677001953125)
Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (348.9360046386719 468.677001953125)
Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (405.6090087890625 468.677001953125)
Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (462.281982421875 468.677001953125)
Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (518.9549560546875 468.677001953125)
Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (35.92070007324219 725.447998046875)
Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 212.5050048828125)
Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
非常奇特:垂直线实际上被画成很短(大约 0.6 个单位)很粗(大约 513 个单位)的水平线...
我正在尝试在 pdf 中查找 table 边框线。我用了pdfBox的PrintTextLocations
class来造字。现在我正在寻找构成 table 的不同线条的坐标。我尝试使用 org.apache.pdfbox.pdfviewer.PageDrawer
,但找不到任何包含这些行的 character/graphic。我尝试了两种方法:
第一个:
Graphics g = null;
Dimension d = new Dimension();
d.setSize(700, 700);
PageDrawer pageDrawer = new PageDrawer();
pageDrawer.drawPage(g, myPage, d);
它给了我空指针异常。所以其次,我试图覆盖 processStream
功能,但我无法获得任何中风。请帮帮我。我愿意使用任何其他库,它会给我 table 中的行坐标。另一个快速问题,pdfbox 中的那些 table 边框线是什么类型的对象?这些是图形还是字符?
这是我尝试解析的示例 pdf 的 link: http://stats.bls.gov/news.release/pdf/empsit.pdf 并尝试获取第 8 页上的 table 行。
Edit :我遇到了另一个问题,在解析此 pdf 的第 1 页时,我无法获得任何行作为 printPath()
中的 pathIterator
函数为空,尽管为每一行调用了 strokePath()
函数。如何使用此 pdf?
在 1.8.* 版本中,PDFBox 解析功能的实现方式不是很通用,特别是 OperatorProcessor
实现与特定解析器 classes 紧密相关,例如处理路径绘制操作的实现假定与 PageDrawer
实例交互。
因此,除非有人想要复制和粘贴所有那些 OperatorProcessor
classes 并进行微小的更改,否则必须从这样一个特定的解析器派生 class.
因此,在您的情况下,我们也将从 PageDrawer
派生我们的解析器,毕竟我们 对路径绘制操作感兴趣:
public class PrintPaths extends PageDrawer
{
//
// constructor
//
public PrintPaths() throws IOException
{
super();
}
//
// method overrides for mere path observation
//
// ignore text
@Override
protected void processTextPosition(TextPosition text) { }
// ignore bitmaps
@Override
public void drawImage(Image awtImage, AffineTransform at) { }
// ignore shadings
@Override
public void shFill(COSName shadingName) throws IOException { }
@Override
public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
{
PDRectangle cropBox = aPage.findCropBox();
this.pageSize = cropBox.createDimension();
super.processStream(aPage, resources, cosStream);
}
@Override
public void fillPath(int windingRule) throws IOException
{
printPath();
System.out.printf("Fill; windingrule: %s\n\n", windingRule);
getLinePath().reset();
}
@Override
public void strokePath() throws IOException
{
printPath();
System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
getLinePath().reset();
}
void printPath()
{
GeneralPath path = getLinePath();
PathIterator pathIterator = path.getPathIterator(null);
double x = 0, y = 0;
double coords[] = new double[6];
while (!pathIterator.isDone()) {
switch (pathIterator.currentSegment(coords)) {
case PathIterator.SEG_MOVETO:
System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_LINETO:
double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
x = coords[0];
y = coords[1];
break;
case PathIterator.SEG_QUADTO:
System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
x = coords[2];
y = coords[3];
break;
case PathIterator.SEG_CUBICTO:
System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
x = coords[4];
y = coords[5];
break;
case PathIterator.SEG_CLOSE:
System.out.println("Close path");
}
pathIterator.next();
}
}
double getEffectiveWidth(double dirX, double dirY)
{
if (dirX == 0 && dirY == 0)
return 0;
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
double widthX = dirY;
double widthY = -dirX;
double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
return getGraphicsState().getLineWidth() * factor;
}
}
因为我们不想实际绘制页面而只是提取将被绘制的路径,我们必须剥离PageDrawer
像这样。
此示例解析器输出路径绘制操作以展示如何执行此操作。显然,您可以改为收集它们以进行自动化处理...
您可以像这样使用解析器:
PDDocument document = PDDocument.load(resource);
List<?> allPages = document.getDocumentCatalog().getAllPages();
int i = 7; // page 8
System.out.println("\n\nPage " + (i+1));
PrintPaths printPaths = new PrintPaths();
PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null)
{
printPaths.processStream(page, page.findResources(), page.getContents().getStream());
}
输出为:
Page 8
Move to (35.92070007324219 724.6490478515625)
Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 694.4660034179688)
Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (292.2610168457031 468.677001953125)
Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (348.9360046386719 468.677001953125)
Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (405.6090087890625 468.677001953125)
Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (462.281982421875 468.677001953125)
Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (518.9549560546875 468.677001953125)
Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
Stroke; unscaled width: 5129.43
Move to (35.92070007324219 725.447998046875)
Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
Move to (35.92070007324219 212.5050048828125)
Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
Stroke; unscaled width: 5.981
非常奇特:垂直线实际上被画成很短(大约 0.6 个单位)很粗(大约 513 个单位)的水平线...