如何使用 pdfbox 或其他 java 库减小合并的 PDF/A-1b 文件的大小
How to reduce the size of merged PDF/A-1b files with pdfbox or other java library
输入:包含嵌入字体的(例如 14 个)PDF/A-1b 个文件的列表。
正在处理:与 Apache PDFBOX 进行简单合并。
结果: 1 PDF/A-1b 个文件过大(过大)。 (几乎是所有源文件大小的总和)
问题:有没有办法减小生成的 PDF 的文件大小?
想法:删除多余的嵌入字体。但是怎么办?这是正确的做法吗?
不幸的是,下面的代码并没有起到作用,而是突出了明显的问题。
try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
List<COSName> collectedFonts = new ArrayList<>();
PDPageTree pages = document.getDocumentCatalog().getPages();
int pageNr = 0;
for (PDPage page : pages) {
pageNr++;
Iterable<COSName> names = page.getResources().getFontNames();
System.out.println("Page " + pageNr);
for (COSName name : names) {
collectedFonts.add(name);
System.out.print("\t" + name + " - ");
PDFont font = page.getResources().getFont(name);
System.out.println(font + ", embedded: " + font.isEmbedded());
page.getCOSObject().removeItem(COSName.F);
page.getResources().getCOSObject().removeItem(name);
}
}
document.save("E:/tmp/output.pdf");
}
代码产生如下输出:
Page 1
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
感谢任何帮助...
此答案中的代码试图优化像 OP 的示例文档这样的文档,即包含完全相同对象副本的文档,在手头的情况下完全相同,完全嵌入的字体。它不仅仅合并几乎相同的对象,例如同一字体的多个子集合并为一个联合子集。
在对问题的评论过程中,很明显 OP 的 PDF 中的重复字体确实是源字体文件的完整副本。要合并此类重复对象,必须收集文档的复杂对象(数组、字典、流),将它们相互比较,然后合并重复项。
由于文档的所有复杂对象的实际成对比较在大型文档的情况下可能会花费太多时间,因此以下代码计算这些对象的哈希值并且仅比较具有相同哈希值的对象。
为了合并重复项,代码选择其中一个重复项并将对任何其他重复项的所有引用替换为对所选副本的引用,从而从文档对象池中删除其他重复项。为了更有效地做到这一点,代码最初不仅收集所有复杂对象,还收集对每个对象的所有引用。
优化代码
这是调用优化a的方法PDDocument
:
public void optimize(PDDocument pdDocument) throws IOException {
Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
for (int pass = 0; ; pass++) {
int merges = mergeDuplicates(complexObjects);
if (merges <= 0) {
System.out.printf("Pass %d - No merged objects\n\n", pass);
break;
}
System.out.printf("Pass %d - Merged objects: %d\n\n", pass, merges);
}
}
(OptimizeAfterMerge方法待测)
优化需要多次通过,因为某些对象的相等性只有在它们引用的副本被合并后才能识别。
以下辅助方法和 classes 收集 PDF 的复杂对象以及对每个对象的引用:
Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
incomingReferences.put(catalogDictionary, new ArrayList<>());
Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
Set<COSBase> thisPass = new HashSet<>();
while(!lastPass.isEmpty()) {
for (COSBase object : lastPass) {
if (object instanceof COSArray) {
COSArray array = (COSArray) object;
for (int i = 0; i < array.size(); i++) {
addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
}
} else if (object instanceof COSDictionary) {
COSDictionary dictionary = (COSDictionary) object;
for (COSName key : dictionary.keySet()) {
addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
}
}
}
lastPass = thisPass;
thisPass = new HashSet<>();
}
return incomingReferences;
}
void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
COSBase object = reference.getTo();
if (object instanceof COSArray || object instanceof COSDictionary) {
Collection<Reference> incoming = incomingReferences.get(object);
if (incoming == null) {
incoming = new ArrayList<>();
incomingReferences.put(object, incoming);
thisPass.add(object);
}
incoming.add(reference);
}
}
(OptimizeAfterMerge 辅助方法 findComplexObjects
和 addTarget
)
interface Reference {
public COSBase getFrom();
public COSBase getTo();
public void setTo(COSBase to);
}
static class ArrayReference implements Reference {
public ArrayReference(COSArray array, int index) {
this.from = array;
this.index = index;
}
@Override
public COSBase getFrom() {
return from;
}
@Override
public COSBase getTo() {
return resolve(from.get(index));
}
@Override
public void setTo(COSBase to) {
from.set(index, to);
}
final COSArray from;
final int index;
}
static class DictionaryReference implements Reference {
public DictionaryReference(COSDictionary dictionary, COSName key) {
this.from = dictionary;
this.key = key;
}
@Override
public COSBase getFrom() {
return from;
}
@Override
public COSBase getTo() {
return resolve(from.getDictionaryObject(key));
}
@Override
public void setTo(COSBase to) {
from.setItem(key, to);
}
final COSDictionary from;
final COSName key;
}
(OptimizeAfterMerge 辅助接口 Reference
实现 ArrayReference
和 DictionaryReference
)
以及以下辅助方法和 classes 最终识别并合并重复项:
int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
for (COSBase object : complexObjects.keySet()) {
hashes.add(new HashOfCOSBase(object));
}
Collections.sort(hashes);
int removedDuplicates = 0;
if (!hashes.isEmpty()) {
int runStart = 0;
int runHash = hashes.get(0).hash;
for (int i = 1; i < hashes.size(); i++) {
int hash = hashes.get(i).hash;
if (hash != runHash) {
int runSize = i - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
}
runHash = hash;
runStart = i;
}
}
int runSize = hashes.size() - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
}
}
return removedDuplicates;
}
int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
int removedDuplicates = 0;
List<List<COSBase>> duplicateSets = new ArrayList<>();
for (HashOfCOSBase entry : run) {
COSBase element = entry.object;
for (List<COSBase> duplicateSet : duplicateSets) {
if (equals(element, duplicateSet.get(0))) {
duplicateSet.add(element);
element = null;
break;
}
}
if (element != null) {
List<COSBase> duplicateSet = new ArrayList<>();
duplicateSet.add(element);
duplicateSets.add(duplicateSet);
}
}
System.out.printf("Identified %d set(s) of identical objects in run.\n", duplicateSets.size());
for (List<COSBase> duplicateSet : duplicateSets) {
if (duplicateSet.size() > 1) {
COSBase surviver = duplicateSet.remove(0);
Collection<Reference> surviverReferences = complexObjects.get(surviver);
for (COSBase object : duplicateSet) {
Collection<Reference> references = complexObjects.get(object);
for (Reference reference : references) {
reference.setTo(surviver);
surviverReferences.add(reference);
}
complexObjects.remove(object);
removedDuplicates++;
}
surviver.setDirect(false);
}
}
return removedDuplicates;
}
boolean equals(COSBase a, COSBase b) {
if (a instanceof COSArray) {
if (b instanceof COSArray) {
COSArray aArray = (COSArray) a;
COSArray bArray = (COSArray) b;
if (aArray.size() == bArray.size()) {
for (int i=0; i < aArray.size(); i++) {
if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
return false;
}
return true;
}
}
} else if (a instanceof COSDictionary) {
if (b instanceof COSDictionary) {
COSDictionary aDict = (COSDictionary) a;
COSDictionary bDict = (COSDictionary) b;
Set<COSName> keys = aDict.keySet();
if (keys.equals(bDict.keySet())) {
for (COSName key : keys) {
if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
return false;
}
// In case of COSStreams we strictly speaking should
// also compare the stream contents here. But apparently
// their hashes coincide well enough for the original
// hashing equality, so let's just assume...
return true;
}
}
}
return false;
}
static COSBase resolve(COSBase object) {
while (object instanceof COSObject)
object = ((COSObject)object).getObject();
return object;
}
(OptimizeAfterMerge 辅助方法 mergeDuplicates
、mergeRun
、equals
和 resolve
)
static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
public HashOfCOSBase(COSBase object) throws IOException {
this.object = object;
this.hash = calculateHash(object);
}
int calculateHash(COSBase object) throws IOException {
if (object instanceof COSArray) {
int result = 1;
for (COSBase member : (COSArray)object)
result = 31 * result + member.hashCode();
return result;
} else if (object instanceof COSDictionary) {
int result = 3;
for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
result += entry.hashCode();
if (object instanceof COSStream) {
try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] buffer = new byte[8192];
int bytesRead = 0;
while((bytesRead = data.read(buffer)) >= 0)
md.update(buffer, 0, bytesRead);
result = 31 * result + Arrays.hashCode(md.digest());
} catch (NoSuchAlgorithmException e) {
throw new IOException(e);
}
}
return result;
} else {
throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
}
}
final COSBase object;
final int hash;
@Override
public int compareTo(HashOfCOSBase o) {
int result = Integer.compare(hash, o.hash);
if (result == 0)
result = Integer.compare(hashCode(), o.hashCode());
return result;
}
}
(OptimizeAfterMerge帮手classHashOfCOSBase
)
将代码应用于 OP 的示例文档
OP 的示例文档大小约为 6.5 MB。像这样应用上面的代码
PDDocument pdDocument = PDDocument.load(SOURCE);
optimize(pdDocument);
pdDocument.save(RESULT);
生成小于 700 KB 的 PDF,而且看起来很完整。
(如有遗漏,请告知,我会尽力修复。)
警告语
一方面,此优化器不会识别所有相同的重复项。特别是在循环引用的情况下,将无法识别对象的重复圆圈,因为代码仅在其内容相同时才识别重复项,这通常不会发生在重复的对象圆圈中。
另一方面,此优化器在某些情况下可能已经过于急切,因为可能需要一些副本作为单独的对象,以便 PDF 查看器将每个实例作为单独的实体接受。
此外,该程序涉及文件中的各种对象,甚至是那些定义 PDF 内部结构的对象,但它不会尝试更新管理此结构的任何 PDFBox classes(PDDocument
、PDDocumentCatalog
、PDAcroForm
、...)。为了不让任何未决的更改搞砸整个文档,因此,请仅将此程序应用于新加载的、未修改的 PDDocument
个实例,并立即保存它,不要再费心了。
在文件中调试时,我发现相同字体的字体文件被多次引用。因此,用已经查看过的字体文件项替换字典中的实际字体文件项,删除引用并可以进行压缩。至此,我能够将一个 30 MB 的文件缩小到大约 6 MB。
File file = new File("test.pdf");
PDDocument doc = PDDocument.load(file);
Map<String, COSBase> fontFileCache = new HashMap<>();
for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
final PDPage page = doc.getPage(pageNumber);
COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
for (COSName currentFont : pageDictionary.keySet()) {
COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
for (COSName actualFont : fontDictionary.keySet()) {
COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
if (actualFontDictionaryObject instanceof COSDictionary) {
COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
}
}
}
}
}
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos);
final File compressed = new File("test_compressed.pdf");
baos.writeTo(new FileOutputStream(compressed));
也许这不是最优雅的方式,但它可以工作并保持 PDF/A-1b 兼容性。
我发现的另一种方法是以这种方式使用 ITEXT 7 (pdfWriter.setSmartMode):
try (PdfWriter pdfWriter = new PdfWriter(out)) {
pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
PdfMerger merger = new PdfMerger(pdfDoc);
merger.setCloseSourceDocuments(true);
try {
for (InputStream pdf : pdfs) {
try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
merger.merge(doc, createPageList(doc.getNumberOfPages()));
}
}
merger.close();
}
catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
e);
}
catch (com.itextpdf.io.IOException | PdfException e) {
throw new BieneException(e.getMessage(), e);
}
}
}
输入:包含嵌入字体的(例如 14 个)PDF/A-1b 个文件的列表。
正在处理:与 Apache PDFBOX 进行简单合并。
结果: 1 PDF/A-1b 个文件过大(过大)。 (几乎是所有源文件大小的总和)
问题:有没有办法减小生成的 PDF 的文件大小?
想法:删除多余的嵌入字体。但是怎么办?这是正确的做法吗?
不幸的是,下面的代码并没有起到作用,而是突出了明显的问题。
try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
List<COSName> collectedFonts = new ArrayList<>();
PDPageTree pages = document.getDocumentCatalog().getPages();
int pageNr = 0;
for (PDPage page : pages) {
pageNr++;
Iterable<COSName> names = page.getResources().getFontNames();
System.out.println("Page " + pageNr);
for (COSName name : names) {
collectedFonts.add(name);
System.out.print("\t" + name + " - ");
PDFont font = page.getResources().getFont(name);
System.out.println(font + ", embedded: " + font.isEmbedded());
page.getCOSObject().removeItem(COSName.F);
page.getResources().getCOSObject().removeItem(name);
}
}
document.save("E:/tmp/output.pdf");
}
代码产生如下输出:
Page 1
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
感谢任何帮助...
此答案中的代码试图优化像 OP 的示例文档这样的文档,即包含完全相同对象副本的文档,在手头的情况下完全相同,完全嵌入的字体。它不仅仅合并几乎相同的对象,例如同一字体的多个子集合并为一个联合子集。
在对问题的评论过程中,很明显 OP 的 PDF 中的重复字体确实是源字体文件的完整副本。要合并此类重复对象,必须收集文档的复杂对象(数组、字典、流),将它们相互比较,然后合并重复项。
由于文档的所有复杂对象的实际成对比较在大型文档的情况下可能会花费太多时间,因此以下代码计算这些对象的哈希值并且仅比较具有相同哈希值的对象。
为了合并重复项,代码选择其中一个重复项并将对任何其他重复项的所有引用替换为对所选副本的引用,从而从文档对象池中删除其他重复项。为了更有效地做到这一点,代码最初不仅收集所有复杂对象,还收集对每个对象的所有引用。
优化代码
这是调用优化a的方法PDDocument
:
public void optimize(PDDocument pdDocument) throws IOException {
Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
for (int pass = 0; ; pass++) {
int merges = mergeDuplicates(complexObjects);
if (merges <= 0) {
System.out.printf("Pass %d - No merged objects\n\n", pass);
break;
}
System.out.printf("Pass %d - Merged objects: %d\n\n", pass, merges);
}
}
(OptimizeAfterMerge方法待测)
优化需要多次通过,因为某些对象的相等性只有在它们引用的副本被合并后才能识别。
以下辅助方法和 classes 收集 PDF 的复杂对象以及对每个对象的引用:
Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
incomingReferences.put(catalogDictionary, new ArrayList<>());
Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
Set<COSBase> thisPass = new HashSet<>();
while(!lastPass.isEmpty()) {
for (COSBase object : lastPass) {
if (object instanceof COSArray) {
COSArray array = (COSArray) object;
for (int i = 0; i < array.size(); i++) {
addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
}
} else if (object instanceof COSDictionary) {
COSDictionary dictionary = (COSDictionary) object;
for (COSName key : dictionary.keySet()) {
addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
}
}
}
lastPass = thisPass;
thisPass = new HashSet<>();
}
return incomingReferences;
}
void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
COSBase object = reference.getTo();
if (object instanceof COSArray || object instanceof COSDictionary) {
Collection<Reference> incoming = incomingReferences.get(object);
if (incoming == null) {
incoming = new ArrayList<>();
incomingReferences.put(object, incoming);
thisPass.add(object);
}
incoming.add(reference);
}
}
(OptimizeAfterMerge 辅助方法 findComplexObjects
和 addTarget
)
interface Reference {
public COSBase getFrom();
public COSBase getTo();
public void setTo(COSBase to);
}
static class ArrayReference implements Reference {
public ArrayReference(COSArray array, int index) {
this.from = array;
this.index = index;
}
@Override
public COSBase getFrom() {
return from;
}
@Override
public COSBase getTo() {
return resolve(from.get(index));
}
@Override
public void setTo(COSBase to) {
from.set(index, to);
}
final COSArray from;
final int index;
}
static class DictionaryReference implements Reference {
public DictionaryReference(COSDictionary dictionary, COSName key) {
this.from = dictionary;
this.key = key;
}
@Override
public COSBase getFrom() {
return from;
}
@Override
public COSBase getTo() {
return resolve(from.getDictionaryObject(key));
}
@Override
public void setTo(COSBase to) {
from.setItem(key, to);
}
final COSDictionary from;
final COSName key;
}
(OptimizeAfterMerge 辅助接口 Reference
实现 ArrayReference
和 DictionaryReference
)
以及以下辅助方法和 classes 最终识别并合并重复项:
int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
for (COSBase object : complexObjects.keySet()) {
hashes.add(new HashOfCOSBase(object));
}
Collections.sort(hashes);
int removedDuplicates = 0;
if (!hashes.isEmpty()) {
int runStart = 0;
int runHash = hashes.get(0).hash;
for (int i = 1; i < hashes.size(); i++) {
int hash = hashes.get(i).hash;
if (hash != runHash) {
int runSize = i - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
}
runHash = hash;
runStart = i;
}
}
int runSize = hashes.size() - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
}
}
return removedDuplicates;
}
int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
int removedDuplicates = 0;
List<List<COSBase>> duplicateSets = new ArrayList<>();
for (HashOfCOSBase entry : run) {
COSBase element = entry.object;
for (List<COSBase> duplicateSet : duplicateSets) {
if (equals(element, duplicateSet.get(0))) {
duplicateSet.add(element);
element = null;
break;
}
}
if (element != null) {
List<COSBase> duplicateSet = new ArrayList<>();
duplicateSet.add(element);
duplicateSets.add(duplicateSet);
}
}
System.out.printf("Identified %d set(s) of identical objects in run.\n", duplicateSets.size());
for (List<COSBase> duplicateSet : duplicateSets) {
if (duplicateSet.size() > 1) {
COSBase surviver = duplicateSet.remove(0);
Collection<Reference> surviverReferences = complexObjects.get(surviver);
for (COSBase object : duplicateSet) {
Collection<Reference> references = complexObjects.get(object);
for (Reference reference : references) {
reference.setTo(surviver);
surviverReferences.add(reference);
}
complexObjects.remove(object);
removedDuplicates++;
}
surviver.setDirect(false);
}
}
return removedDuplicates;
}
boolean equals(COSBase a, COSBase b) {
if (a instanceof COSArray) {
if (b instanceof COSArray) {
COSArray aArray = (COSArray) a;
COSArray bArray = (COSArray) b;
if (aArray.size() == bArray.size()) {
for (int i=0; i < aArray.size(); i++) {
if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
return false;
}
return true;
}
}
} else if (a instanceof COSDictionary) {
if (b instanceof COSDictionary) {
COSDictionary aDict = (COSDictionary) a;
COSDictionary bDict = (COSDictionary) b;
Set<COSName> keys = aDict.keySet();
if (keys.equals(bDict.keySet())) {
for (COSName key : keys) {
if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
return false;
}
// In case of COSStreams we strictly speaking should
// also compare the stream contents here. But apparently
// their hashes coincide well enough for the original
// hashing equality, so let's just assume...
return true;
}
}
}
return false;
}
static COSBase resolve(COSBase object) {
while (object instanceof COSObject)
object = ((COSObject)object).getObject();
return object;
}
(OptimizeAfterMerge 辅助方法 mergeDuplicates
、mergeRun
、equals
和 resolve
)
static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
public HashOfCOSBase(COSBase object) throws IOException {
this.object = object;
this.hash = calculateHash(object);
}
int calculateHash(COSBase object) throws IOException {
if (object instanceof COSArray) {
int result = 1;
for (COSBase member : (COSArray)object)
result = 31 * result + member.hashCode();
return result;
} else if (object instanceof COSDictionary) {
int result = 3;
for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
result += entry.hashCode();
if (object instanceof COSStream) {
try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] buffer = new byte[8192];
int bytesRead = 0;
while((bytesRead = data.read(buffer)) >= 0)
md.update(buffer, 0, bytesRead);
result = 31 * result + Arrays.hashCode(md.digest());
} catch (NoSuchAlgorithmException e) {
throw new IOException(e);
}
}
return result;
} else {
throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
}
}
final COSBase object;
final int hash;
@Override
public int compareTo(HashOfCOSBase o) {
int result = Integer.compare(hash, o.hash);
if (result == 0)
result = Integer.compare(hashCode(), o.hashCode());
return result;
}
}
(OptimizeAfterMerge帮手classHashOfCOSBase
)
将代码应用于 OP 的示例文档
OP 的示例文档大小约为 6.5 MB。像这样应用上面的代码
PDDocument pdDocument = PDDocument.load(SOURCE);
optimize(pdDocument);
pdDocument.save(RESULT);
生成小于 700 KB 的 PDF,而且看起来很完整。
(如有遗漏,请告知,我会尽力修复。)
警告语
一方面,此优化器不会识别所有相同的重复项。特别是在循环引用的情况下,将无法识别对象的重复圆圈,因为代码仅在其内容相同时才识别重复项,这通常不会发生在重复的对象圆圈中。
另一方面,此优化器在某些情况下可能已经过于急切,因为可能需要一些副本作为单独的对象,以便 PDF 查看器将每个实例作为单独的实体接受。
此外,该程序涉及文件中的各种对象,甚至是那些定义 PDF 内部结构的对象,但它不会尝试更新管理此结构的任何 PDFBox classes(PDDocument
、PDDocumentCatalog
、PDAcroForm
、...)。为了不让任何未决的更改搞砸整个文档,因此,请仅将此程序应用于新加载的、未修改的 PDDocument
个实例,并立即保存它,不要再费心了。
在文件中调试时,我发现相同字体的字体文件被多次引用。因此,用已经查看过的字体文件项替换字典中的实际字体文件项,删除引用并可以进行压缩。至此,我能够将一个 30 MB 的文件缩小到大约 6 MB。
File file = new File("test.pdf");
PDDocument doc = PDDocument.load(file);
Map<String, COSBase> fontFileCache = new HashMap<>();
for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
final PDPage page = doc.getPage(pageNumber);
COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
for (COSName currentFont : pageDictionary.keySet()) {
COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
for (COSName actualFont : fontDictionary.keySet()) {
COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
if (actualFontDictionaryObject instanceof COSDictionary) {
COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
}
}
}
}
}
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos);
final File compressed = new File("test_compressed.pdf");
baos.writeTo(new FileOutputStream(compressed));
也许这不是最优雅的方式,但它可以工作并保持 PDF/A-1b 兼容性。
我发现的另一种方法是以这种方式使用 ITEXT 7 (pdfWriter.setSmartMode):
try (PdfWriter pdfWriter = new PdfWriter(out)) {
pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
PdfMerger merger = new PdfMerger(pdfDoc);
merger.setCloseSourceDocuments(true);
try {
for (InputStream pdf : pdfs) {
try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
merger.merge(doc, createPageList(doc.getNumberOfPages()));
}
}
merger.close();
}
catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
e);
}
catch (com.itextpdf.io.IOException | PdfException e) {
throw new BieneException(e.getMessage(), e);
}
}
}