从损坏的 HTML 中提取文本?

Extracting text from damaged HTML?

即使在图书行业,DRM 也是一种瘟疫。上周我发现我的许多 Kindle 注释都不见了,因为出版商试图将注释限制在书的 10% 以内。

我发现了将 Mobi 图书文件转换为 HTML 的工具。我还使用了位置数据(幸好没有丢失)来提取适当的原始 html 块。我现在的问题是我有很多不完整的标记语言要处理。

示例:

></h1><div height="3em"></div> <p height="0em" width="1em" align="justify"><em>A Pocket Mirror for Heroes</em> is a book of stratagems for reaching excellence in a competitive world ruled by appearances and, often, deceit.</p><div height="0em"></div> <p height="0em" width="1em" align="justify">It is a <em>mirror</em> because it reflects &#x201C;the person you are or the one you ought to be.&#x201D; A <em>pocket</em> mirror because its author took the time to be brief. A mirror for <em>heroes</em> because it provides a vivid image of ethical and moral perfection. For the author, a hero is &#x201C;the consummate person, ripe and perfect: accurate in judgment, mature in taste, attentive in listening, wise in sayings, shrewd in deeds, the cente

这是因为 Kindle 中的位置数据仅对应 HTML 数据的 150 字节块。这意味着有很多不精确性。

我想清理一下。有没有人有什么建议?如果可能,我更愿意使用 Python。

编辑:使用一个可以给字符偏移量的工具也可能有意义,它会弄清楚如何从中提取清晰的内容。有这样的东西吗?

BeautifulSoup 可以解析格式错误的 HTML 并且它非常健壮。

>>> html = "<p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
>>> soup = BeautifulSoup(html)
>>> print(soup.prettify())
<p>
 Para 1
 <p>
  Para 2
  <blockquote>
   Quote 1
   <blockquote>
    Quote 2
   </blockquote>
  </blockquote>
 </p>
</p>