从损坏的 HTML 中提取文本?
Extracting text from damaged HTML?
即使在图书行业,DRM 也是一种瘟疫。上周我发现我的许多 Kindle 注释都不见了,因为出版商试图将注释限制在书的 10% 以内。
我发现了将 Mobi 图书文件转换为 HTML 的工具。我还使用了位置数据(幸好没有丢失)来提取适当的原始 html 块。我现在的问题是我有很多不完整的标记语言要处理。
示例:
></h1><div height="3em"></div> <p height="0em" width="1em" align="justify"><em>A Pocket Mirror for Heroes</em> is a book of stratagems for reaching excellence in a competitive world ruled by appearances and, often, deceit.</p><div height="0em"></div> <p height="0em" width="1em" align="justify">It is a <em>mirror</em> because it reflects “the person you are or the one you ought to be.” A <em>pocket</em> mirror because its author took the time to be brief. A mirror for <em>heroes</em> because it provides a vivid image of ethical and moral perfection. For the author, a hero is “the consummate person, ripe and perfect: accurate in judgment, mature in taste, attentive in listening, wise in sayings, shrewd in deeds, the cente
这是因为 Kindle 中的位置数据仅对应 HTML 数据的 150 字节块。这意味着有很多不精确性。
我想清理一下。有没有人有什么建议?如果可能,我更愿意使用 Python。
编辑:使用一个可以给字符偏移量的工具也可能有意义,它会弄清楚如何从中提取清晰的内容。有这样的东西吗?
BeautifulSoup 可以解析格式错误的 HTML 并且它非常健壮。
>>> html = "<p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
>>> soup = BeautifulSoup(html)
>>> print(soup.prettify())
<p>
Para 1
<p>
Para 2
<blockquote>
Quote 1
<blockquote>
Quote 2
</blockquote>
</blockquote>
</p>
</p>
即使在图书行业,DRM 也是一种瘟疫。上周我发现我的许多 Kindle 注释都不见了,因为出版商试图将注释限制在书的 10% 以内。
我发现了将 Mobi 图书文件转换为 HTML 的工具。我还使用了位置数据(幸好没有丢失)来提取适当的原始 html 块。我现在的问题是我有很多不完整的标记语言要处理。
示例:
></h1><div height="3em"></div> <p height="0em" width="1em" align="justify"><em>A Pocket Mirror for Heroes</em> is a book of stratagems for reaching excellence in a competitive world ruled by appearances and, often, deceit.</p><div height="0em"></div> <p height="0em" width="1em" align="justify">It is a <em>mirror</em> because it reflects “the person you are or the one you ought to be.” A <em>pocket</em> mirror because its author took the time to be brief. A mirror for <em>heroes</em> because it provides a vivid image of ethical and moral perfection. For the author, a hero is “the consummate person, ripe and perfect: accurate in judgment, mature in taste, attentive in listening, wise in sayings, shrewd in deeds, the cente
这是因为 Kindle 中的位置数据仅对应 HTML 数据的 150 字节块。这意味着有很多不精确性。
我想清理一下。有没有人有什么建议?如果可能,我更愿意使用 Python。
编辑:使用一个可以给字符偏移量的工具也可能有意义,它会弄清楚如何从中提取清晰的内容。有这样的东西吗?
BeautifulSoup 可以解析格式错误的 HTML 并且它非常健壮。
>>> html = "<p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
>>> soup = BeautifulSoup(html)
>>> print(soup.prettify())
<p>
Para 1
<p>
Para 2
<blockquote>
Quote 1
<blockquote>
Quote 2
</blockquote>
</blockquote>
</p>
</p>