如何使用 Java Apache PDFBox 查找 PDF 中的所有内部链接
How to find all internal links in a PDF, using Java Apache PDFBox
我正在使用以下代码 (Kotlin) 在 PDF
中查找 hyperlinks
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.pdmodel.interactive.action.PDActionURI
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink
import ... destination.PDPageXYZDestination
import java.io.File
fun findAnnotationsTest() {
val pdfPath = "LinkedPDF.pdf"
val doc = PDDocument.load(File(pdfPath))
var pageNo = 0
for (page in doc.pages) {
pageNo++
for (annotation in page.annotations) {
val subtype = annotation.subtype
println("Found Annotation ($subtype) on page $pageNo")
if (annotation is PDAnnotationLink) {
val aname = annotation.annotationName
println("\t\tfound Link named $aname on page $pageNo")
val link = annotation
println("\t\tas string: " + link.toString());
println("\t\tdestination: " + link.getDestination());
val dest = link.destination
val destClass = dest::class
println("\t\tdest class is $destClass")
if(dest is PDPageXYZDestination){
val pageNumber = dest.pageNumber
println("\t\tdest page number is $pageNumber")
}
val action = link.action
if (action == null) {
println("\t\tbut action is null")
continue
}
if (action is PDActionURI)
println("\t\tURI action is ${action.uri}")
else
println("\t\tother action is ${action::class}")
}
else{
println("\tNOT a link")
}
}
}
}
输入文件有数百个(工作)内部 links。
此代码找到注释并将它们识别为 link,但 PDActions 为空,PDPageXYZDestination 的页码为 -1。每个 link 的输出如下所示:
Found Annotation (Link) on page 216
found Link (Link) named null on page 216
as string: org.apache.pdfbox....annotation.PDAnnotationLink@3234e239
destination: org.apache.pdfbox.....destination.PDPageXYZDestination@3d921e20
dest class is class org.apache.pdfbox...destination.PDPageXYZDestination
dest page number is -1
but action is null
顺便说一句,PDF 是通过将 MS Word 文档(具有内部 links 到 Word 书签)另存为 PDF 创建的。
对我做错了什么有什么想法吗?
这是 PDF(样本):NBSample.pdf
PDPageDestination 的目的地不是一个数字(这仅适用于外部页面链接),它是一个页面字典,因此需要额外的努力来获取数字(方法 javadoc 提到了这一点)。这里是 PrintBookmarks.java 示例的稍作修改的摘录:
if (dest instanceof PDPageDestination)
{
PDPageDestination pd = (PDPageDestination) dest;
System.out.println("Destination page: " + (pd.retrievePageNumber() + 1));
}
我正在使用以下代码 (Kotlin) 在 PDF
中查找 hyperlinks import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.pdmodel.interactive.action.PDActionURI
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink
import ... destination.PDPageXYZDestination
import java.io.File
fun findAnnotationsTest() {
val pdfPath = "LinkedPDF.pdf"
val doc = PDDocument.load(File(pdfPath))
var pageNo = 0
for (page in doc.pages) {
pageNo++
for (annotation in page.annotations) {
val subtype = annotation.subtype
println("Found Annotation ($subtype) on page $pageNo")
if (annotation is PDAnnotationLink) {
val aname = annotation.annotationName
println("\t\tfound Link named $aname on page $pageNo")
val link = annotation
println("\t\tas string: " + link.toString());
println("\t\tdestination: " + link.getDestination());
val dest = link.destination
val destClass = dest::class
println("\t\tdest class is $destClass")
if(dest is PDPageXYZDestination){
val pageNumber = dest.pageNumber
println("\t\tdest page number is $pageNumber")
}
val action = link.action
if (action == null) {
println("\t\tbut action is null")
continue
}
if (action is PDActionURI)
println("\t\tURI action is ${action.uri}")
else
println("\t\tother action is ${action::class}")
}
else{
println("\tNOT a link")
}
}
}
}
输入文件有数百个(工作)内部 links。
此代码找到注释并将它们识别为 link,但 PDActions 为空,PDPageXYZDestination 的页码为 -1。每个 link 的输出如下所示:
Found Annotation (Link) on page 216
found Link (Link) named null on page 216
as string: org.apache.pdfbox....annotation.PDAnnotationLink@3234e239
destination: org.apache.pdfbox.....destination.PDPageXYZDestination@3d921e20
dest class is class org.apache.pdfbox...destination.PDPageXYZDestination
dest page number is -1
but action is null
顺便说一句,PDF 是通过将 MS Word 文档(具有内部 links 到 Word 书签)另存为 PDF 创建的。
对我做错了什么有什么想法吗?
这是 PDF(样本):NBSample.pdf
PDPageDestination 的目的地不是一个数字(这仅适用于外部页面链接),它是一个页面字典,因此需要额外的努力来获取数字(方法 javadoc 提到了这一点)。这里是 PrintBookmarks.java 示例的稍作修改的摘录:
if (dest instanceof PDPageDestination)
{
PDPageDestination pd = (PDPageDestination) dest;
System.out.println("Destination page: " + (pd.retrievePageNumber() + 1));
}