无需加载完整文档即可获取 PDF XMP 元数据

Question

使用 iTextSharp 或 iText 等库，您可以通过 PdfReader 从 PDF 文档中提取元数据：

using (var reader = new PdfReader(pdfBytes))
{
    return reader.Metadata == null ? null : Encoding.UTF8.GetString(reader.Metadata);
}

这类库在能够整理元数据之前会完全解析 PDF 文档。在我的例子中，这将导致系统资源的高使用率，因为我们每秒收到很多请求，PDF 很大。

有没有一种方法可以从 PDF 中提取元数据而无需先将其完全加载到内存中？

Answer 1

使用 PDF4NET，您可以提取 XMP 元数据而无需将整个文档加载到内存中：

// This does a minimal parsing of the PDF file and loads 
// only a few objects from the file
PDFFile pdfFile = new PDFFile(new MemoryStream(pdfBytes));

string xmpMetadata = pdfFile.ExtractXmpMetadata();

更新 1：代码更改为从字节数组加载文件

免责声明：我为开发 PDF4NET 库的公司工作。

Answer 2

iText 5.x 也允许部分阅读 PDF，只是看起来有点复杂。

而不是

using (var reader = new PdfReader(pdfBytes))

使用

using (var reader = new PdfReader(new RandomAccessFileOrArray(pdfBytes), null, true))

最后 true 请求部分阅读。

无需加载完整文档即可获取 PDF XMP 元数据

Get PDF XMP Metadata without loading the complete document

c#

pdf

xmp