如何从 java 中的 pdf 文档中搜索某些特定的字符串或单词以及坐标
How to search some specific string or a word and there coordinates from a pdf document in java
我正在使用 Pdfbox 从 pdf 文件中搜索单词(或字符串),我还想知道该单词的坐标。
例如:- 在 pdf 文件中有一个类似“${abc}”的字符串。我想知道这个字符串的坐标。
我尝试了几个例子,但没有得到我的结果。
结果它显示了字符的坐标。
这是代码
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
for(TextPosition text : textPositions) {
System.out.println( "String[" + text.getXDirAdj() + "," +
text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
text.getXScale() + " height=" + text.getHeightDir() + " space=" +
text.getWidthOfSpace() + " width=" +
text.getWidthDirAdj() + "]" + text.getUnicode());
}
}
我正在使用 pdfbox 2.0
如前所述,这不是您问题的答案,但下面是您如何在 IText
中执行此操作的框架示例。这并不是说在 Pdfbox 中是不可能的。
基本上,您制作一个 RenderListener
来接受 "parse events"。您将此侦听器传递给 PdfReaderContentParser.processContent
。在侦听器的 renderText
方法中,您可以获得重建布局所需的所有信息,包括 x/y 坐标和构成内容的 text/image/...。
RenderListener listener = new RenderListener() {
@Override
public void renderText(TextRenderInfo arg0) {
LineSegment segment = arg0.getBaseline();
int x = (int) segment.getStartPoint().get(Vector.I1);
// smaller Y means closer to the BOTTOM of the page. So we negate the Y to get proper top-to-bottom ordering
int y = -(int) segment.getStartPoint().get(Vector.I2);
int endx = (int) segment.getEndPoint().get(Vector.I1);
log.debug("renderText "+x+".."+endx+"/"+y+": "+arg0.getText());
...
}
... // other overrides
};
PdfReaderContentParser p = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
log.info("handling page "+i);
p.processContent(i, listener);
}
PDFBox' PDFTextStripper
class 仍然有带位置的文本(在它被缩减为纯文本之前)的最后一个方法是方法
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {@link #writeString(String)}.
*
* @param text The text to write to the stream.
* @param textPositions The TextPositions belonging to the text.
* @throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
应该在此处拦截,因为此方法接收预处理,特别是 sorted TextPosition
对象(如果有人请求 sorting 开头).
(实际上,我更愿意在调用方法 writeLine
中进行拦截,根据其参数和局部变量的名称,该方法具有 行的所有 TextPosition
个实例 并每 word
调用一次 writeString
;不幸的是,PDFBox 开发人员已将此方法声明为私有...好吧,也许这会在最终的 2.0.0 版本发布之前发生变化... 轻推,轻推。更新: 不幸的是,它在发行版中没有改变...... 叹息)
此外,使用帮助程序 class 将 TextPosition
实例序列包装在 String
类 class 中有助于使代码更清晰。
考虑到这一点,可以像这样搜索变量
List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
TextPositionSequence word = new TextPositionSequence(textPositions);
String string = word.toString();
int fromIndex = 0;
int index;
while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
{
hits.add(word.subSequence(index, index + searchTerm.length()));
fromIndex = index + 1;
}
super.writeString(text, textPositions);
}
};
stripper.setSortByPosition(true);
stripper.setStartPage(page);
stripper.setEndPage(page);
stripper.getText(document);
return hits;
}
有了这个帮手class
public class TextPositionSequence implements CharSequence
{
public TextPositionSequence(List<TextPosition> textPositions)
{
this(textPositions, 0, textPositions.size());
}
public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
{
this.textPositions = textPositions;
this.start = start;
this.end = end;
}
@Override
public int length()
{
return end - start;
}
@Override
public char charAt(int index)
{
TextPosition textPosition = textPositionAt(index);
String text = textPosition.getUnicode();
return text.charAt(0);
}
@Override
public TextPositionSequence subSequence(int start, int end)
{
return new TextPositionSequence(textPositions, this.start + start, this.start + end);
}
@Override
public String toString()
{
StringBuilder builder = new StringBuilder(length());
for (int i = 0; i < length(); i++)
{
builder.append(charAt(i));
}
return builder.toString();
}
public TextPosition textPositionAt(int index)
{
return textPositions.get(start + index);
}
public float getX()
{
return textPositions.get(start).getXDirAdj();
}
public float getY()
{
return textPositions.get(start).getYDirAdj();
}
public float getWidth()
{
if (end == start)
return 0;
TextPosition first = textPositions.get(start);
TextPosition last = textPositions.get(end - 1);
return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
}
final List<TextPosition> textPositions;
final int start, end;
}
只输出它们的位置、宽度、最后的字母和最后的字母位置,然后你可以使用这个
void printSubwords(PDDocument document, String searchTerm) throws IOException
{
System.out.printf("* Looking for '%s'\n", searchTerm);
for (int page = 1; page <= document.getNumberOfPages(); page++)
{
List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
for (TextPositionSequence hit : hits)
{
TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
System.out.printf(" Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
page, hit.getX(), hit.getY(), hit.getWidth(),
lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
}
}
}
为了测试,我使用 MS Word 创建了一个小测试文件:
这个测试的输出
@Test
public void testVariables() throws IOException
{
try ( InputStream resource = getClass().getResourceAsStream("Variables.pdf");
PDDocument document = PDDocument.load(resource); )
{
System.out.println("\nVariables.pdf\n-------------\n");
printSubwords(document, "${var1}");
printSubwords(document, "${var 2}");
}
}
是
Variables.pdf
-------------
* Looking for '${var1}'
Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18
* Looking for '${var 2}'
Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
我有点惊讶,因为 ${var 2}
如果在一行上就已经找到了;毕竟,PDFBox 代码让我假设我覆盖的方法 writeString
只检索 words;看起来它检索到的行部分比单纯的单词要长...
如果您需要分组的 TextPosition
个实例中的其他数据,只需相应地增强 TextPositionSequence
。
我一直在寻找突出显示 PDF 文件中不同单词的方法。为此,我需要正确地知道单词坐标,所以我正在做的是从左上角、第一个字母和最后一个字母获取 (x, y) 坐标右上角的字母。
稍后,将点保存在一个数组中。请记住,为了正确获得 y 坐标,由于给定的坐标,您需要相对于页面大小的相对位置。但是getYDirAdj()
方法是绝对的,很多时候与页面中的方法不匹配。
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
boolean isFound = false;
float posXInit = 0,
posXEnd = 0,
posYInit = 0,
posYEnd = 0,
width = 0,
height = 0,
fontHeight = 0;
String[] criteria = {"Word1", "Word2", "Word3", ....};
for (int i = 0; i < criteria.length; i++) {
if (string.contains(criteria[i])) {
isFound = true;
}
}
if (isFound) {
posXInit = textPositions.get(0).getXDirAdj();
posXEnd = textPositions.get(textPositions.size() - 1).getXDirAdj() + textPositions.get(textPositions.size() - 1).getWidth();
posYInit = textPositions.get(0).getPageHeight() - textPositions.get(0).getYDirAdj();
posYEnd = textPositions.get(0).getPageHeight() - textPositions.get(textPositions.size() - 1).getYDirAdj();
width = textPositions.get(0).getWidthDirAdj();
height = textPositions.get(0).getHeightDir();
System.out.println(string + "X-Init = " + posXInit + "; Y-Init = " + posYInit + "; X-End = " + posXEnd + "; Y-End = " + posYEnd + "; Font-Height = " + fontHeight);
float quadPoints[] = {posXInit, posYEnd + height + 2, posXEnd, posYEnd + height + 2, posXInit, posYInit - 2, posXEnd, posYEnd - 2};
List<PDAnnotation> annotations = document.getPage(this.getCurrentPageNo() - 1).getAnnotations();
PDAnnotationTextMarkup highlight = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
PDRectangle position = new PDRectangle();
position.setLowerLeftX(posXInit);
position.setLowerLeftY(posYEnd);
position.setUpperRightX(posXEnd);
position.setUpperRightY(posYEnd + height);
highlight.setRectangle(position);
// quadPoints is array of x,y coordinates in Z-like order (top-left, top-right, bottom-left,bottom-right)
// of the area to be highlighted
highlight.setQuadPoints(quadPoints);
PDColor yellow = new PDColor(new float[]{1, 1, 1 / 255F}, PDDeviceRGB.INSTANCE);
highlight.setColor(yellow);
annotations.add(highlight);
}
}
你可以试试这个
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
TextPosition startPos = textPositions.get(0);
TextPosition endPos = textPositions.get(textPositions.size() - 1);
System.out.println(str + " [(" + startPos.getXDirAdj() + "," + startPos.getYDirAdj() + ") ,("
+ endPos.getXDirAdj() + "," + endPos.getYDirAdj() + ")]");
}
输出看起来像这样'String [(54.0,746.08) ,(99.71,746.08)]'
我正在使用 Pdfbox 从 pdf 文件中搜索单词(或字符串),我还想知道该单词的坐标。 例如:- 在 pdf 文件中有一个类似“${abc}”的字符串。我想知道这个字符串的坐标。 我尝试了几个例子,但没有得到我的结果。 结果它显示了字符的坐标。
这是代码
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
for(TextPosition text : textPositions) {
System.out.println( "String[" + text.getXDirAdj() + "," +
text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
text.getXScale() + " height=" + text.getHeightDir() + " space=" +
text.getWidthOfSpace() + " width=" +
text.getWidthDirAdj() + "]" + text.getUnicode());
}
}
我正在使用 pdfbox 2.0
如前所述,这不是您问题的答案,但下面是您如何在 IText
中执行此操作的框架示例。这并不是说在 Pdfbox 中是不可能的。
基本上,您制作一个 RenderListener
来接受 "parse events"。您将此侦听器传递给 PdfReaderContentParser.processContent
。在侦听器的 renderText
方法中,您可以获得重建布局所需的所有信息,包括 x/y 坐标和构成内容的 text/image/...。
RenderListener listener = new RenderListener() {
@Override
public void renderText(TextRenderInfo arg0) {
LineSegment segment = arg0.getBaseline();
int x = (int) segment.getStartPoint().get(Vector.I1);
// smaller Y means closer to the BOTTOM of the page. So we negate the Y to get proper top-to-bottom ordering
int y = -(int) segment.getStartPoint().get(Vector.I2);
int endx = (int) segment.getEndPoint().get(Vector.I1);
log.debug("renderText "+x+".."+endx+"/"+y+": "+arg0.getText());
...
}
... // other overrides
};
PdfReaderContentParser p = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
log.info("handling page "+i);
p.processContent(i, listener);
}
PDFBox' PDFTextStripper
class 仍然有带位置的文本(在它被缩减为纯文本之前)的最后一个方法是方法
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {@link #writeString(String)}.
*
* @param text The text to write to the stream.
* @param textPositions The TextPositions belonging to the text.
* @throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
应该在此处拦截,因为此方法接收预处理,特别是 sorted TextPosition
对象(如果有人请求 sorting 开头).
(实际上,我更愿意在调用方法 writeLine
中进行拦截,根据其参数和局部变量的名称,该方法具有 行的所有 TextPosition
个实例 并每 word
调用一次 writeString
;不幸的是,PDFBox 开发人员已将此方法声明为私有...好吧,也许这会在最终的 2.0.0 版本发布之前发生变化... 轻推,轻推。更新: 不幸的是,它在发行版中没有改变...... 叹息)
此外,使用帮助程序 class 将 TextPosition
实例序列包装在 String
类 class 中有助于使代码更清晰。
考虑到这一点,可以像这样搜索变量
List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
TextPositionSequence word = new TextPositionSequence(textPositions);
String string = word.toString();
int fromIndex = 0;
int index;
while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
{
hits.add(word.subSequence(index, index + searchTerm.length()));
fromIndex = index + 1;
}
super.writeString(text, textPositions);
}
};
stripper.setSortByPosition(true);
stripper.setStartPage(page);
stripper.setEndPage(page);
stripper.getText(document);
return hits;
}
有了这个帮手class
public class TextPositionSequence implements CharSequence
{
public TextPositionSequence(List<TextPosition> textPositions)
{
this(textPositions, 0, textPositions.size());
}
public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
{
this.textPositions = textPositions;
this.start = start;
this.end = end;
}
@Override
public int length()
{
return end - start;
}
@Override
public char charAt(int index)
{
TextPosition textPosition = textPositionAt(index);
String text = textPosition.getUnicode();
return text.charAt(0);
}
@Override
public TextPositionSequence subSequence(int start, int end)
{
return new TextPositionSequence(textPositions, this.start + start, this.start + end);
}
@Override
public String toString()
{
StringBuilder builder = new StringBuilder(length());
for (int i = 0; i < length(); i++)
{
builder.append(charAt(i));
}
return builder.toString();
}
public TextPosition textPositionAt(int index)
{
return textPositions.get(start + index);
}
public float getX()
{
return textPositions.get(start).getXDirAdj();
}
public float getY()
{
return textPositions.get(start).getYDirAdj();
}
public float getWidth()
{
if (end == start)
return 0;
TextPosition first = textPositions.get(start);
TextPosition last = textPositions.get(end - 1);
return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
}
final List<TextPosition> textPositions;
final int start, end;
}
只输出它们的位置、宽度、最后的字母和最后的字母位置,然后你可以使用这个
void printSubwords(PDDocument document, String searchTerm) throws IOException
{
System.out.printf("* Looking for '%s'\n", searchTerm);
for (int page = 1; page <= document.getNumberOfPages(); page++)
{
List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
for (TextPositionSequence hit : hits)
{
TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
System.out.printf(" Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
page, hit.getX(), hit.getY(), hit.getWidth(),
lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
}
}
}
为了测试,我使用 MS Word 创建了一个小测试文件:
这个测试的输出
@Test
public void testVariables() throws IOException
{
try ( InputStream resource = getClass().getResourceAsStream("Variables.pdf");
PDDocument document = PDDocument.load(resource); )
{
System.out.println("\nVariables.pdf\n-------------\n");
printSubwords(document, "${var1}");
printSubwords(document, "${var 2}");
}
}
是
Variables.pdf
-------------
* Looking for '${var1}'
Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18
* Looking for '${var 2}'
Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
我有点惊讶,因为 ${var 2}
如果在一行上就已经找到了;毕竟,PDFBox 代码让我假设我覆盖的方法 writeString
只检索 words;看起来它检索到的行部分比单纯的单词要长...
如果您需要分组的 TextPosition
个实例中的其他数据,只需相应地增强 TextPositionSequence
。
我一直在寻找突出显示 PDF 文件中不同单词的方法。为此,我需要正确地知道单词坐标,所以我正在做的是从左上角、第一个字母和最后一个字母获取 (x, y) 坐标右上角的字母。
稍后,将点保存在一个数组中。请记住,为了正确获得 y 坐标,由于给定的坐标,您需要相对于页面大小的相对位置。但是getYDirAdj()
方法是绝对的,很多时候与页面中的方法不匹配。
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
boolean isFound = false;
float posXInit = 0,
posXEnd = 0,
posYInit = 0,
posYEnd = 0,
width = 0,
height = 0,
fontHeight = 0;
String[] criteria = {"Word1", "Word2", "Word3", ....};
for (int i = 0; i < criteria.length; i++) {
if (string.contains(criteria[i])) {
isFound = true;
}
}
if (isFound) {
posXInit = textPositions.get(0).getXDirAdj();
posXEnd = textPositions.get(textPositions.size() - 1).getXDirAdj() + textPositions.get(textPositions.size() - 1).getWidth();
posYInit = textPositions.get(0).getPageHeight() - textPositions.get(0).getYDirAdj();
posYEnd = textPositions.get(0).getPageHeight() - textPositions.get(textPositions.size() - 1).getYDirAdj();
width = textPositions.get(0).getWidthDirAdj();
height = textPositions.get(0).getHeightDir();
System.out.println(string + "X-Init = " + posXInit + "; Y-Init = " + posYInit + "; X-End = " + posXEnd + "; Y-End = " + posYEnd + "; Font-Height = " + fontHeight);
float quadPoints[] = {posXInit, posYEnd + height + 2, posXEnd, posYEnd + height + 2, posXInit, posYInit - 2, posXEnd, posYEnd - 2};
List<PDAnnotation> annotations = document.getPage(this.getCurrentPageNo() - 1).getAnnotations();
PDAnnotationTextMarkup highlight = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
PDRectangle position = new PDRectangle();
position.setLowerLeftX(posXInit);
position.setLowerLeftY(posYEnd);
position.setUpperRightX(posXEnd);
position.setUpperRightY(posYEnd + height);
highlight.setRectangle(position);
// quadPoints is array of x,y coordinates in Z-like order (top-left, top-right, bottom-left,bottom-right)
// of the area to be highlighted
highlight.setQuadPoints(quadPoints);
PDColor yellow = new PDColor(new float[]{1, 1, 1 / 255F}, PDDeviceRGB.INSTANCE);
highlight.setColor(yellow);
annotations.add(highlight);
}
}
你可以试试这个
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
TextPosition startPos = textPositions.get(0);
TextPosition endPos = textPositions.get(textPositions.size() - 1);
System.out.println(str + " [(" + startPos.getXDirAdj() + "," + startPos.getYDirAdj() + ") ,("
+ endPos.getXDirAdj() + "," + endPos.getYDirAdj() + ")]");
}
输出看起来像这样'String [(54.0,746.08) ,(99.71,746.08)]'