如何在同一文本(电子书)的不同修订版中找到相同的文本字符串?

How do I locate the same string of text across different revisions of the same text (an ebook)?

我在电子书中突出显示了一串文本。这本电子书每隔几年就会推出新的修订版本。我想以编程方式在所有这些更新的电子书版本中重新定位此突出显示。我将如何解决这个问题? (假设我可以阅读突出显示的原始电子书。)


数据结构如下所示。 loc 只是关于作为单个字符串布置的本书的整个文本的字符索引。 toc 是 table 的内容。

// a single highlight
{
  "start_loc": 5000,
  "end_loc": 5044,
  "end_loc_of_book": 10000,
  "highlighted_text": "The quick brown fox jumps over the lazy dog.",
  "toc_path": ["Chapter 5: Animal Relationships", "Foxes and dogs"],
}

// an ebook
{
  "toc": [
    {
      "heading_title": "Chapter 1: All work and no play makes Jack a dull boy",
      "heading_start_loc": 0,
      "heading_end_loc": 2000,
      // each heading can have nested subheadings within
      // the range of its start_loc and end_loc
      "subheadings": [
        {
          "heading_title": "Jack is still a dull boy",
          "heading_start_loc": 300,
          "heading_end_loc": 500,
          // each heading can have nested subheadings within
          // the range of its start_loc and end_loc
          "subheadings": []
        },
        // ...
      ]
    },
    // ...
    {
      "heading_title": "Chapter 5: Animal Relationships",
      "heading_start_loc": 4000,
      "heading_end_loc": 6000,
      "subheadings": [
        {
          "heading_title": "Foxes and dogs",
          "heading_start_loc": 4500,
          "heading_end_loc": 5500,
          "subheadings": []
        },
        // ...
      ]
    },
    // ...
  ],
  "full_book_text": "Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore
et dolore magna aliqua. In fermentum et sollicitudin ac 
orci phasellus.

...

The quick brown fox jumps over the lazy dog.

...

Praesent semper feugiat nibh sed pulvinar proin. Augue 
eget arcu dictum varius duis at consectetur lorem donec.
Adipiscing elit duis tristique sollicitudin."
}

这个问题的解决方案是模糊锚定,detailed by hypothesis.is

在要点中,保存一堆文档结构独立选择器并使用近似策略对新文档中突出显示的位置进行有根据的猜测。

这包括:

  1. 指向原始文档中元素的 XPath 选择器
  2. 相对于原始文档全文的开始和结束偏移量
  3. 原始高亮显示前的 32 个字符和原始高亮显示后缀的 32 个字符