Pdfclown:How 覆盖 pdfclown 中现有的突出显示关键字
Pdfclown:How to override the existing highlighted keyword in pdfclown
我在 pdfclown 中得到了要求,比如如果很少有关键字与另一个关键字 substring/matched,同时突出显示这些关键字必须被覆盖并且应该允许突出显示完整关键字。例如在下面的地图 ETS关键字是 just.ETS 和 Test.ETS 关键字的子字符串。预期结果应该像我们需要突出显示完整关键字,如 just.ETS , Test.ETS 而不是 ETS 关键字及其弹出度量值。 .ActualPdf and actual result pdf. and jar path.
Map<String, String> m = new HashMap<String, String>();
map.put("ETS" , "Loss");
map.put("Just. ETS" , "Net ");
map.put("Test. ETS" , "Profit");
(注意:1.如果大尺寸关键字已经在文件中突出显示,那么与大关键字匹配的小尺寸关键字不应突出显示 2.如果小尺寸关键字已经突出显示并且该关键字与大关键字匹配那么大关键字应该突出显示 ignore/unhighlight 小关键字。)。
import java.awt.Color;
import java.awt.Desktop;
import java.awt.geom.Rectangle2D;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.File;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;
public class pdfclown2 {
private static int count;
public static void main(String[] args) throws IOException {
highlight("C:\Users\uc23\Desktop\pdf\80743064.pdf","C:\Users\\Downloads\6.pdf");
System.out.println("OK");
}
private static void highlight(String inputPath, String outputPath) throws IOException {
org.pdfclown.files.File file = null;
try {
file = new org.pdfclown.files.File("C:\Users\uc239646\Desktop\test.pdf");
List<Keyword> l=new ArrayList<Keyword>();
Keyword k=new Keyword();
Keyword k1=new Keyword();
k1.setKey("Just. ETS");
k1.setValue("NET");
l.add(k1);
Keyword k2=new Keyword();
k2.setKey("Test. ETS");
k2.setValue("PROFIT");
l.add(k2);
k.setKey("ETS");
k.setValue("LOSS");
l.add(k);
long startTime = System.currentTimeMillis();
// 2. Iterating through the document pages...
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
for (Keyword e : l) {
Pattern pattern;
String serachKey = e.getKey();
final String translationKeyword = e.getValue();
if ((serachKey.contains(")") && serachKey.contains("("))
|| (serachKey.contains("(") && !serachKey.contains(")"))
|| (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
|| serachKey.contains("*") || serachKey.contains("+")) {
pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
}
else
pattern = Pattern.compile("\b"+serachKey+"\b", Pattern.CASE_INSENSITIVE);
// 2.1. Extract the page text!
//System.out.println(textStrings.toString().indexOf(entry.getKey()));
// 2.2. Find the text pattern matches!
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
// 2.3. Highlight the text pattern matches!
//System.out.println(textStrings);
textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
public boolean hasNext() {
// if(key.getMatchCriteria() == 1){
if (matcher.find()) {
return true;
}
/*
* } else if(key.getMatchCriteria() == 2) { if
*
*
*
*
*
*
*
*
* (matcher.hitEnd()) { count++; return true; } }
*/
return false;
}
public Interval<Integer> next() {
return new Interval<Integer>(matcher.start(), matcher.end());
}
public void process(Interval<Integer> interval, ITextString match) {
System.out.println(match);
// Defining the highlight box of the text pattern
// match...
/*List l=new ArrayList();
if(!l.contains(match)){
System.out.println("map.put("+match+","+translationKeyword+")");
}
*/
List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for (TextChar textChar : match.getTextChars()) {
Rectangle2D textCharBox = textChar.getBox();
if (textBox == null) {
textBox = (Rectangle2D) textCharBox.clone();
} else {
if (textCharBox.getY() > textBox.getMaxY()) {
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D) textCharBox.clone();
} else {
textBox.add(textCharBox);
}
}
System.out.println(highlightQuads.contains(textBox));
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
highlightQuads.add(Quad.get(textBox));
}
/* List<Quad> highlightQuads = new ArrayList<Quad>();
List<TextChar> textChars = match.getTextChars();
Rectangle2D firstRect = textChars.get(0).getBox();
Rectangle2D lastRect = textChars.get(textChars.size()-1).getBox();
Rectangle2D rect = firstRect.createUnion(lastRect);
highlightQuads.add(Quad.get(rect));*/
// subtype can be Highlight, Underline, StrikeOut, Squiggly
new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);
}
}
public void remove() {
throw new UnsupportedOperationException();
}
});
}
}
SerializationModeEnum serializationMode = SerializationModeEnum.Standard;
file.save(new java.io.File(outputPath), serializationMode);
System.out.println("file created");
long endTime = System.currentTimeMillis();
System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);
} catch (Exception e) {
e.printStackTrace();
}
}
}
正如评论中已经提到的(同时 moved to chat):
Your issue only becomes a PDF Clown issue because you try to put the cart before the horse:
You have determined that you are creating too many highlights.
The obvious solution would be to stop making those surplus highlights from the start, and sorting that out is an issue unrelated to PDF Clown.
Your attempted solutions, on the other hand, is to remove the surplus highlights after the fact, and only this makes it an PDF Clown issue for you because now you have to search the already existing highlights for overlaps. That solution is a possible one, too, but it unnecessarily wastes resources.
这里有一种方法可以在为它们创建之前 突出显示之前整理出不需要的匹配项。页面上的循环内容被替换如下:
[...]
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
List<Match> matches = new ArrayList<>();
for (Keyword e : l) {
final String searchKey = e.getKey();
final String translationKeyword = e.getValue();
final Pattern pattern;
if ((searchKey.contains(")") && searchKey.contains("("))
|| (searchKey.contains("(") && !searchKey.contains(")"))
|| (searchKey.contains(")") && !searchKey.contains("(")) || searchKey.contains("?")
|| searchKey.contains("*") || searchKey.contains("+")) {
pattern = Pattern.compile(Pattern.quote(searchKey), Pattern.CASE_INSENSITIVE);
} else
pattern = Pattern.compile("\b" + searchKey + "\b", Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
public boolean hasNext() {
return matcher.find();
}
public Interval<Integer> next() {
return new Interval<Integer>(matcher.start(), matcher.end(), true, false);
}
public void process(Interval<Integer> interval, ITextString match) {
matches.add(new Match(interval, match, translationKeyword));
}
public void remove() {
throw new UnsupportedOperationException();
}
});
}
removeOverlaps(matches);
for (Match match : matches) {
List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for (TextChar textChar : match.match.getTextChars()) {
Rectangle2D textCharBox = textChar.getBox();
if (textBox == null) {
textBox = (Rectangle2D) textCharBox.clone();
} else {
if (textCharBox.getY() > textBox.getMaxY()) {
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D) textCharBox.clone();
} else {
textBox.add(textCharBox);
}
}
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(),
textBox.getHeight());
highlightQuads.add(Quad.get(textBox));
}
new TextMarkup(page, highlightQuads, match.tag, MarkupTypeEnum.Highlight);
}
}
}
[...]
(ComplexHighlight 测试 testMarkLikeSeshadriImproved
)
使用这些辅助方法/classes:
static void removeOverlaps(List<Match> matches) {
Collections.sort(matches, ComplexHighlight::compareLowLengthTag);
for (int i = 0; i < matches.size() - 1; i++) {
Interval<Integer> intervalI = matches.get(i).interval;
for (int j = i + 1; j < matches.size(); j++) {
Interval<Integer> intervalJ = matches.get(j).interval;
if (intervalI.getLow() < intervalJ.getHigh() && intervalJ.getLow() < intervalI.getHigh()) {
System.out.printf("Match %d removed as it overlaps match %d.\n", j, i);
matches.remove(j--);
}
}
}
}
(ComplexHighlight方法removeOverlaps
)
static int compareLowLengthTag(Match a, Match b) {
int compare = a.interval.getLow().compareTo(b.interval.getLow());
if (compare == 0)
compare = - a.interval.getHigh().compareTo(b.interval.getHigh());
if (compare == 0)
compare = a.tag.compareTo(b.tag);
return compare;
}
(ComplexHighlight方法compareLowLengthTag
)
class Match {
final Interval<Integer> interval;
final ITextString match;
final String tag;
public Match(final Interval<Integer> interval, final ITextString match, final String tag) {
this.interval = interval;
this.match = match;
this.tag = tag;
}
}
(Match class)
如您所见,此处的匹配项不会立即添加为亮点,而是收集在列表中 matches
。然后这个列表被处理为不再包含重叠,并且只有没有重叠的剩余列表的元素被添加为高亮。
正如评论中提到的,必须决定比赛的优先级。
例如在搜索词“AB”和“BCD”以及文档文本“ABCD”的情况下,上面使用的比较方法 compareLowLengthTag
总是更喜欢 AB 匹配,而下面的比较方法 compareLengthLowTag
更喜欢更长的匹配 BCD 和只有在长度相等的情况下才会选择更早开始的比赛:
static int compareLengthLowTag(Match a, Match b) {
int aLength = a.interval.getHigh() - a.interval.getLow();
int bLength = b.interval.getHigh() - b.interval.getLow();
int compare = - Integer.compare(aLength, bLength);
if (compare == 0)
compare = a.interval.getLow().compareTo(b.interval.getLow());
if (compare == 0)
compare = a.tag.compareTo(b.tag);
return compare;
}
(ComplexHighlight方法compareLengthLowTag
)
我在 pdfclown 中得到了要求,比如如果很少有关键字与另一个关键字 substring/matched,同时突出显示这些关键字必须被覆盖并且应该允许突出显示完整关键字。例如在下面的地图 ETS关键字是 just.ETS 和 Test.ETS 关键字的子字符串。预期结果应该像我们需要突出显示完整关键字,如 just.ETS , Test.ETS 而不是 ETS 关键字及其弹出度量值。 .ActualPdf and actual result pdf. and jar path.
Map<String, String> m = new HashMap<String, String>();
map.put("ETS" , "Loss");
map.put("Just. ETS" , "Net ");
map.put("Test. ETS" , "Profit");
(注意:1.如果大尺寸关键字已经在文件中突出显示,那么与大关键字匹配的小尺寸关键字不应突出显示 2.如果小尺寸关键字已经突出显示并且该关键字与大关键字匹配那么大关键字应该突出显示 ignore/unhighlight 小关键字。)。
import java.awt.Color;
import java.awt.Desktop;
import java.awt.geom.Rectangle2D;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.File;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;
public class pdfclown2 {
private static int count;
public static void main(String[] args) throws IOException {
highlight("C:\Users\uc23\Desktop\pdf\80743064.pdf","C:\Users\\Downloads\6.pdf");
System.out.println("OK");
}
private static void highlight(String inputPath, String outputPath) throws IOException {
org.pdfclown.files.File file = null;
try {
file = new org.pdfclown.files.File("C:\Users\uc239646\Desktop\test.pdf");
List<Keyword> l=new ArrayList<Keyword>();
Keyword k=new Keyword();
Keyword k1=new Keyword();
k1.setKey("Just. ETS");
k1.setValue("NET");
l.add(k1);
Keyword k2=new Keyword();
k2.setKey("Test. ETS");
k2.setValue("PROFIT");
l.add(k2);
k.setKey("ETS");
k.setValue("LOSS");
l.add(k);
long startTime = System.currentTimeMillis();
// 2. Iterating through the document pages...
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
for (Keyword e : l) {
Pattern pattern;
String serachKey = e.getKey();
final String translationKeyword = e.getValue();
if ((serachKey.contains(")") && serachKey.contains("("))
|| (serachKey.contains("(") && !serachKey.contains(")"))
|| (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
|| serachKey.contains("*") || serachKey.contains("+")) {
pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
}
else
pattern = Pattern.compile("\b"+serachKey+"\b", Pattern.CASE_INSENSITIVE);
// 2.1. Extract the page text!
//System.out.println(textStrings.toString().indexOf(entry.getKey()));
// 2.2. Find the text pattern matches!
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
// 2.3. Highlight the text pattern matches!
//System.out.println(textStrings);
textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
public boolean hasNext() {
// if(key.getMatchCriteria() == 1){
if (matcher.find()) {
return true;
}
/*
* } else if(key.getMatchCriteria() == 2) { if
*
*
*
*
*
*
*
*
* (matcher.hitEnd()) { count++; return true; } }
*/
return false;
}
public Interval<Integer> next() {
return new Interval<Integer>(matcher.start(), matcher.end());
}
public void process(Interval<Integer> interval, ITextString match) {
System.out.println(match);
// Defining the highlight box of the text pattern
// match...
/*List l=new ArrayList();
if(!l.contains(match)){
System.out.println("map.put("+match+","+translationKeyword+")");
}
*/
List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for (TextChar textChar : match.getTextChars()) {
Rectangle2D textCharBox = textChar.getBox();
if (textBox == null) {
textBox = (Rectangle2D) textCharBox.clone();
} else {
if (textCharBox.getY() > textBox.getMaxY()) {
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D) textCharBox.clone();
} else {
textBox.add(textCharBox);
}
}
System.out.println(highlightQuads.contains(textBox));
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
highlightQuads.add(Quad.get(textBox));
}
/* List<Quad> highlightQuads = new ArrayList<Quad>();
List<TextChar> textChars = match.getTextChars();
Rectangle2D firstRect = textChars.get(0).getBox();
Rectangle2D lastRect = textChars.get(textChars.size()-1).getBox();
Rectangle2D rect = firstRect.createUnion(lastRect);
highlightQuads.add(Quad.get(rect));*/
// subtype can be Highlight, Underline, StrikeOut, Squiggly
new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);
}
}
public void remove() {
throw new UnsupportedOperationException();
}
});
}
}
SerializationModeEnum serializationMode = SerializationModeEnum.Standard;
file.save(new java.io.File(outputPath), serializationMode);
System.out.println("file created");
long endTime = System.currentTimeMillis();
System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);
} catch (Exception e) {
e.printStackTrace();
}
}
}
正如评论中已经提到的(同时 moved to chat):
Your issue only becomes a PDF Clown issue because you try to put the cart before the horse:
You have determined that you are creating too many highlights.
The obvious solution would be to stop making those surplus highlights from the start, and sorting that out is an issue unrelated to PDF Clown.
Your attempted solutions, on the other hand, is to remove the surplus highlights after the fact, and only this makes it an PDF Clown issue for you because now you have to search the already existing highlights for overlaps. That solution is a possible one, too, but it unnecessarily wastes resources.
这里有一种方法可以在为它们创建之前 突出显示之前整理出不需要的匹配项。页面上的循环内容被替换如下:
[...]
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
List<Match> matches = new ArrayList<>();
for (Keyword e : l) {
final String searchKey = e.getKey();
final String translationKeyword = e.getValue();
final Pattern pattern;
if ((searchKey.contains(")") && searchKey.contains("("))
|| (searchKey.contains("(") && !searchKey.contains(")"))
|| (searchKey.contains(")") && !searchKey.contains("(")) || searchKey.contains("?")
|| searchKey.contains("*") || searchKey.contains("+")) {
pattern = Pattern.compile(Pattern.quote(searchKey), Pattern.CASE_INSENSITIVE);
} else
pattern = Pattern.compile("\b" + searchKey + "\b", Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
public boolean hasNext() {
return matcher.find();
}
public Interval<Integer> next() {
return new Interval<Integer>(matcher.start(), matcher.end(), true, false);
}
public void process(Interval<Integer> interval, ITextString match) {
matches.add(new Match(interval, match, translationKeyword));
}
public void remove() {
throw new UnsupportedOperationException();
}
});
}
removeOverlaps(matches);
for (Match match : matches) {
List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for (TextChar textChar : match.match.getTextChars()) {
Rectangle2D textCharBox = textChar.getBox();
if (textBox == null) {
textBox = (Rectangle2D) textCharBox.clone();
} else {
if (textCharBox.getY() > textBox.getMaxY()) {
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D) textCharBox.clone();
} else {
textBox.add(textCharBox);
}
}
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(),
textBox.getHeight());
highlightQuads.add(Quad.get(textBox));
}
new TextMarkup(page, highlightQuads, match.tag, MarkupTypeEnum.Highlight);
}
}
}
[...]
(ComplexHighlight 测试 testMarkLikeSeshadriImproved
)
使用这些辅助方法/classes:
static void removeOverlaps(List<Match> matches) {
Collections.sort(matches, ComplexHighlight::compareLowLengthTag);
for (int i = 0; i < matches.size() - 1; i++) {
Interval<Integer> intervalI = matches.get(i).interval;
for (int j = i + 1; j < matches.size(); j++) {
Interval<Integer> intervalJ = matches.get(j).interval;
if (intervalI.getLow() < intervalJ.getHigh() && intervalJ.getLow() < intervalI.getHigh()) {
System.out.printf("Match %d removed as it overlaps match %d.\n", j, i);
matches.remove(j--);
}
}
}
}
(ComplexHighlight方法removeOverlaps
)
static int compareLowLengthTag(Match a, Match b) {
int compare = a.interval.getLow().compareTo(b.interval.getLow());
if (compare == 0)
compare = - a.interval.getHigh().compareTo(b.interval.getHigh());
if (compare == 0)
compare = a.tag.compareTo(b.tag);
return compare;
}
(ComplexHighlight方法compareLowLengthTag
)
class Match {
final Interval<Integer> interval;
final ITextString match;
final String tag;
public Match(final Interval<Integer> interval, final ITextString match, final String tag) {
this.interval = interval;
this.match = match;
this.tag = tag;
}
}
(Match class)
如您所见,此处的匹配项不会立即添加为亮点,而是收集在列表中 matches
。然后这个列表被处理为不再包含重叠,并且只有没有重叠的剩余列表的元素被添加为高亮。
正如评论中提到的,必须决定比赛的优先级。
例如在搜索词“AB”和“BCD”以及文档文本“ABCD”的情况下,上面使用的比较方法 compareLowLengthTag
总是更喜欢 AB 匹配,而下面的比较方法 compareLengthLowTag
更喜欢更长的匹配 BCD 和只有在长度相等的情况下才会选择更早开始的比赛:
static int compareLengthLowTag(Match a, Match b) {
int aLength = a.interval.getHigh() - a.interval.getLow();
int bLength = b.interval.getHigh() - b.interval.getLow();
int compare = - Integer.compare(aLength, bLength);
if (compare == 0)
compare = a.interval.getLow().compareTo(b.interval.getLow());
if (compare == 0)
compare = a.tag.compareTo(b.tag);
return compare;
}
(ComplexHighlight方法compareLengthLowTag
)