按矩形获取图像
Get Images by rectangle
我有这个方法可以提取pdf中特定位置的文本
public static void getTextByRectangle(PDDocument doc,Rectangle rect) throws IOException{
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
stripper.addRegion( "class1", rect );
PDPage firstPage = doc.getPage(0);
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );
}
是否可以做同样的事情但提取图像?
是的,你可以提取所有图像,并比较rect和images的位置。这是 pdfbox 的例子。这可以获得图像位置。
您需要创建一个 class 扩展 PDFStreamEngine
。像这样,
public class PrintImageLocations extends PDFStreamEngine
您应该覆盖 processOperator
。从ctmNew,你可以得到图像位置,然后比较image和你的rect,你会获取正确的图像。
@Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
String operation = operator.getName();
if ("Do".equals(operation)) {
COSName objectName = (COSName) operands.get(0);
PDXObject xobject = getResources().getXObject(objectName);
if (xobject instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject) xobject;
Matrix ctmNew = getGraphicsState().getCurrentTransformationMatrix();
float imageXScale = ctmNew.getScalingFactorX();
float imageYScale = ctmNew.getScalingFactorY();
// position in user space units. 1 unit = 1/72 inch at 72 dpi
System.out.println("position in PDF = " + ctmNew.getTranslateX() + ", " + ctmNew.getTranslateY() + " in user space units");
// displayed size in user space units
System.out.println("displayed size = " + imageXScale + ", " + imageYScale + " in user space units");
} else if (xobject instanceof PDFormXObject) {
PDFormXObject form = (PDFormXObject) xobject;
showForm(form);
}
} else {
super.processOperator(operator, operands);
}
}
感谢 mkl 和 FiReTiTi 的建议。
我有这个方法可以提取pdf中特定位置的文本
public static void getTextByRectangle(PDDocument doc,Rectangle rect) throws IOException{
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
stripper.addRegion( "class1", rect );
PDPage firstPage = doc.getPage(0);
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );
}
是否可以做同样的事情但提取图像?
是的,你可以提取所有图像,并比较rect和images的位置。这是 pdfbox 的例子。这可以获得图像位置。
您需要创建一个 class 扩展
PDFStreamEngine
。像这样,public class PrintImageLocations extends PDFStreamEngine
您应该覆盖
processOperator
。从ctmNew,你可以得到图像位置,然后比较image和你的rect,你会获取正确的图像。@Override protected void processOperator(Operator operator, List<COSBase> operands) throws IOException { String operation = operator.getName(); if ("Do".equals(operation)) { COSName objectName = (COSName) operands.get(0); PDXObject xobject = getResources().getXObject(objectName); if (xobject instanceof PDImageXObject) { PDImageXObject image = (PDImageXObject) xobject; Matrix ctmNew = getGraphicsState().getCurrentTransformationMatrix(); float imageXScale = ctmNew.getScalingFactorX(); float imageYScale = ctmNew.getScalingFactorY(); // position in user space units. 1 unit = 1/72 inch at 72 dpi System.out.println("position in PDF = " + ctmNew.getTranslateX() + ", " + ctmNew.getTranslateY() + " in user space units"); // displayed size in user space units System.out.println("displayed size = " + imageXScale + ", " + imageYScale + " in user space units"); } else if (xobject instanceof PDFormXObject) { PDFormXObject form = (PDFormXObject) xobject; showForm(form); } } else { super.processOperator(operator, operands); } }
感谢 mkl 和 FiReTiTi 的建议。