无法使用 java apache pdfbox 从 PDF 中提取特定坐标的值

Question

我的任务是从 PDF 中提取特定坐标的文本。

我使用 Apache Pdfbox 客户端进行数据提取。

要从 PDF 获取 x、y、高度和宽度坐标，我使用以毫米为单位的 PDF X 更改工具。当我传递矩形中的值时，这些值没有得到空值。

public String getTextUsingPositionsUsingPdf(String pdfLocation, int pageNumber, double x, double y, double width,
                double height) throws IOException {
            String extractedText = "";
            // PDDocument Creates an empty PDF document. You need to add at least
            // one page for the document to be valid.
            // Using load method we can load a PDF document
            PDDocument document = null;
            PDPage page = null;
            try {
                if (pdfLocation.endsWith(".pdf")) {
                    document = PDDocument.load(new File(pdfLocation));
                    int getDocumentPageCount = document.getNumberOfPages();
                    System.out.println(getDocumentPageCount);

                    // Get specific page. THe parameter is pageindex which starts with // 0. If we need to
                    // access the first page then // the pageIdex is 0 PDPage
                    if (getDocumentPageCount > 0) {
                        page = document.getPage(pageNumber + 1);
                    } else if (getDocumentPageCount == 0) {
                        page = document.getPage(0);
                    }
                    // To create a rectangle by passing the x axis, y axis, width and height 
                    Rectangle2D rect = new Rectangle2D.Double(x, y, width, height);
                    String regionName = "region1";

                    // Strip the text from PDF using PDFTextStripper Area with the
                    // help of Rectangle and named need to given for the rectangle
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);
                    stripper.addRegion(regionName, rect);
                    stripper.extractRegions(page);
                    System.out.println("Region is " + stripper.getTextForRegion("region1"));
                    extractedText = stripper.getTextForRegion("region1");
                } else {
                    System.out.println("No data return");
                }
            } catch (IOException e) {
                System.out.println("The file  not found" + "");
            } finally {
                document.close();
            }
            // Return the extracted text and this can be used for assertion
            return extractedText;
        }

请大家指点我的方法是否正确..

Answer 1

I have used this PDF tutorialspoint.com/uipath/uipath_tutorial.pdf.. Where i am trying to find the text "a part of contests" which is have x = 55.6 mm y = 168.8 width = 210.0 mm and height = 297.0. But i am getting empty value

我用这些输入测试了你的方法：

System.out.println("Extracting like Venkatachalam Neelakantan from uipath_tutorial.pdf\n");
float MM_TO_UNITS = 1/(10*2.54f)*72;
String text = getTextUsingPositionsUsingPdf("src/test/resources/mkl/testarea/pdfbox2/extract/uipath_tutorial.pdf",
        0, 55.6 * MM_TO_UNITS, 168.8 * MM_TO_UNITS, 210.0 * MM_TO_UNITS, 297.0 * MM_TO_UNITS);
System.out.printf("\n---\nResult:\n%s\n", text);

(ExtractText 测试 testUiPathTutorial)

得到结果

 part of contents of this e-book in any manner without written consent 

te the contents of our website and tutorials as timely and as precisely as 
, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. 
guarantee regarding the accuracy, timeliness or completeness of our 
tents including this tutorial. If you discover any errors on our website or 
ease notify us at contact@tutorialspoint.com 

i

假设您实际上是在寻找 "a part of contents"，而不是 "a part of contests"，只是缺少 'a'；可能在测量时您寻找可见字母绘图的开头，但实际字形起点稍早于此。如果您选择稍小的 x，例如54.6 毫米，您还将获得 'a'.

考虑到矩形的宽度和高度，您得到的超过 "a part of contents" 显然不足为奇。

如果您想了解 MM_TO_UNITS 常量，请查看 this answer。

无法使用 java apache pdfbox 从 PDF 中提取特定坐标的值

Unable to extract values from PDF for specific coordinates using java apache pdfbox

java

pdfbox