"Find Tag from Selection" 在带标签的 pdf 中不起作用？

Question

我使用 pdfbox 标记了一个 pdf。

我是如何被标记的：我没有提取文本和标记，而是将 mcid 添加到现有内容流（打开和关闭 ex: /p<< MCID 0 >> BDC .. .. .. EMC）和然后我将标记的内容添加到文档根目录结构中。

什么工作：几乎所有东西都像完全标记的 pdf 一样工作正常。它还通过了 PAC3 可访问性检查器。

//Adding tags
tokens.add(++ind, type_check(t_ype, page));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
if (altText != null && !altText.isEmpty()) {
    currentMarkedContentDictionary.setString(COSName.ALT, altText);
}
mcid++;
tokens.add(++ind, currentMarkedContentDictionary);
tokens.add(++ind, Operator.getOperator("BDC"));

// Adding marked content to root structure
structureElement.appendKid(markedContent);

currentSection.appendKid(structureElement);

什么不起作用：标记一个未来后标记结构中丢失了。有一个名为 "Find Tag from Selection" 的选项。不管用。当我 select 进行一些测试并在根结构中按“从 selection 中查找标签”时，它将成为最后一个标签。请在下方 link.

中找到 pdf

https://drive.google.com/file/d/11Lhuj50Bb9kChvD0kL_GOHQn4RNKZ0hR/view?usp=sharing

父树：

https://drive.google.com/file/d/109xhUpqsQSFLPJB2nhXoU9ssMKnyht3G/view?usp=sharing

带有标签和父树的额外文档： https://drive.google.com/file/d/1yzZSsjkb5_dGfq1Wu3VxsH73vr3alRmC/view?usp=sharing

请帮我解决这个问题。

新问题： 我观察到

当 Jaws 读取我标记的文档时，我在 windows 机器中按下 ctl+shift+5 等控件。它会显示像这样的选项下拉>"Read based on tagged structure" 或 >"Top left to bottom right" 和下面两个单选按钮

Read curent page Read all pages image you can see. Shift+CTL+5 in adobe dc you can see image here

I selected "read based on tagging structure and Read current page" 现在下巴没有读取标签结构。但是，如果我对 "Read entire document" 使用相同的文档，它的阅读效果是否完美？

Link 到文档：

https://drive.google.com/file/d/1CguMHa4DikFMP15VGERnPNWRq5vO3u6I/view?usp=sharing

有什么帮助吗？

Answer 1

嵌套问题

How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC)

你做错了。例如，查看文档中页面内容流的开头：

BT
0 i
/C0_0 18 Tf
41.91 740.175 Td
/H2 <</MCID  0  >> BDC
( \) F M M P  8 P S M E) Tj
ET
/TouchUp_TextEdit MP
BT
/C0_1 14 Tf
EMC

关注文本 object 的开头和结尾以及标记的内容，我们看到您有 BT ... BDC ... ET ... BT ... EMC

不过根据规范：

When the marked-content operators BMC, BDC, and EMC are combined with the text object operators BT and ET (see 9.4, “Text Objects”), each pair of matching operators (BMC…EMC, BDC…EMC, or BT…ET) shall be properly (separately) nested. Therefore, the sequences
BMC             BT
  BT              BMC
    …    and         …
  ET              EMC
EMC             ET
are valid, but
BMC             BT
  BT              BMC
    …    and         …
  EMC             ET
BT              EMC
are not valid.

（ISO 32000-1 第 14.6 节“标记的内容”）

此问题已在第二个共享 PDF 中修复，res1.pdf。

缺少 ParentTree 和 StructParents

你的问题关注的问题是

There is an option called "Find Tag from Selection" . Is not working.

从选择中查找标签本质上意味着您拥有某些内容流指令的 MCID，并且您在结构树中搜索引用该标记内容 ID 的结构元素。

PDF 处理器如何执行此操作，在 PDF 规范 ISO 32000-1（或 ISO 32000-2 中的第 14.7.5.4 节）的第 14.7.4.4 节“从内容项中查找结构元素”中进行了描述：

Because a stream cannot contain object references, there is no way for content items that are marked-content sequences to refer directly back to their parent structure elements (the ones to which they belong as content items). Instead, a different mechanism, the structural parent tree, shall be provided for this purpose. For consistency, content items that are entire PDF objects, such as XObjects, shall also use the parent tree to refer to their parent structure elements.

The parent tree is a number tree, accessed from the ParentTree entry in a document’s structure tree root. The tree shall contain an entry for each object that is a content item of at least one structure element and for each content stream containing at least one marked-content sequence that is a content item.

您的 PDF 根本没有那个 ParentTree，并且您的页面不包含要在 parent树。因此，从标记内容到结构树的规定路径是不可能走的。

一个ParentTree被添加到第三个分享的PDF中，new.pdf.

ParentTree 条目不正确

虽然在 new.pdf 中您有一个 ParentTree，但其内容显然不正确：

ParentTree是一棵数字树，即整数映射到这里的某物，所以显然同一个整数键不能有多个条目。

此外，查看其中一个值：

有人看到您声称以下 StructElem 是所有标记内容 ID 的值：

进一步检查此 StructElem，可以看到它代表最后一页上的最后一段。

因此，您的观察

Now instead of "selection not found " it is highlighting the last <P> tag in parent tree. Irrespective of what what we selected.

是可以期待的。如果有人期望任何合理的行为，也就是说，ParentTree 结构被严重破坏。

实际上不仅有这个new.pdf还有res.pdf和tagged without altext.pdf和ParentTree，但是所有这些 ParentTree 像 new.pdf 的树一样被破坏了。

您可能希望在分析不需要的行为时开始检查您创建的结构。

parent 树条目

的另一个问题

之前在 parent 树中描述的问题同时已经解决，不同的页面现在有不同的结构 parent 并且 parent 树数组现在引用不同 MCID 的结构元素.

不过，对于某些文档，现在会出现不同的错误，例如“res29_08_19.pdf”。这里的 parent 树是这样开始的：

特别是数组中的第一个条目用于 MCID 3，第二个条目用于 MCID 4，...

根据规范，这是无效的

The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array.

（ISO 32000-1 第 14.7.4.4 节“从内容项中查找结构元素”）

因此，第一个条目必须用于 MCID 0，第二个条目必须用于 MCID 1，...

您object发表了评论

No I used 0 and 1 Mcid's for Artifacts.

但作为上述推论：不要将 MCID 提供给您没有结构元素的标记内容序列！ MCID 用于在结构层次结构和内容流。如果您在没有结构元素的情况下标记一段内容，请不要给它一个 MCID。

parent 树条目

的另一个问题

您再次报告最新文件的问题 mathpdf.pdf。确实存在问题； Adobe Acrobat Preflight 报告了 5 页的不一致 parent 树映射列表，如下所示：

与前面的问题相比，仅通过查看 parent 树并不能弄清楚原因，还必须查看结构层次结构。

虽然这样做，但一个特性立即引起了人们的注意：在您的 parent 树中，您没有引用 MCID 的实际 parent 结构元素，而是引用了一个新的结构树节点，该节点声称将结构层次结构中的实际 parent 节点作为它自己的 parent（实际上不是它的孩子之一）并且还声称将有问题的 MCID 作为孩子。

例如让我们看一下第一页上的 MCID 0。在结构层次结构中，您有：

在 parent 树中你有：

您应该直接从第一页的 parent 树数组中直接引用 object 238（MCID 0 的结构层次结构 parent）其中 in-between object 62 声称 object 238 作为 parent 和 MCID 0 作为孩子。

报告的不一致可能是由于 parent 树（在 object 62 中）引用的节点声称是 P 段落parent 节点（在 object 238 中）是一个 Span。这是不允许的，一个段落可以包含一个跨度，但不能包含一个。

"Find Tag from Selection" 在带标签的 pdf 中不起作用？

"Find Tag from Selection" is not working in tagged pdf?

java

pdf

itext

pdfbox

嵌套问题

缺少 ParentTree 和 StructParents

ParentTree 条目不正确

parent 树条目

parent 树条目