从 Scraped HTML 中删除重复的 Substrings/Elements？

Question

我从一本 Kindle 电子书中摘录了一大堆 HTML。而且它有很多重复的元素和重复的子字符串。

长话短说，Kindle DRM 删除了我 90% 的注释，我使用它没有删除的位置数据将其全部恢复。但是Amazon的location数据有些不精确（对应150byte chunks），所以我最后冗余了很多。

示例：

<html>
 <body>
  <p>
   aesar”), at the Battle of Pavia (1525).
  </p>
  <div height="0em">
  </div>
  <mbp:pagebreak>
  </mbp:pagebreak>
  <a id="filepos97755">
  </a>
  <h1 align="center" height="2em">
   <font size="5">
    <b>
     KNOW WHEN
     <br/>
     TO RETIRE
    </b>
   </font>
  </h1>
  <div height="3em">
  </div>
  <p align="justify" height="0em" width="1em">
  </p>
 </body>
</html>

<html>
 <body>
  <h1 align="center" height="2em">
   <font size="5">
    <b>
     KNOW WHEN
     <br/>
     TO RETIRE
    </b>
   </font>
  </h1>
  <div height="3em">
  </div>
  <p align="justify" height="0em" width="1em">
   Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
  </p>
  <div height="0em">
  </div>
  <p height="0em">
  </p>
 </body>
</html>



<html>
 <body>
  <p align="justify" height="0em" width="1em">
   Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
  </p>
  <div height="0em">
  </div>
  <p align="justify" height="0em" width="1em">
   It takes great foresight to predict the decline of a restless, relentless wheel. The sharpest gamblers know when to quit
  </p>
 </body>
</html>

有没有人知道什么可能有帮助？

Answer 1

天哪，真是一团糟。从你展示的一小部分输出来看，重要的东西似乎在段落标签中。我会使用漂亮的汤 python (http://www.crummy.com/software/BeautifulSoup/bs4/doc/) 从 <P> 标签中提取所有信息，然后删除多余的信息。如果您还想保留其他格式，那就太麻烦了。我回去后会尝试使用漂亮的汤，但我确信我无法以更好的格式导出它。

从 Scraped HTML 中删除重复的 Substrings/Elements？

Remove Duplicate Substrings/Elements from Scraped HTML?

html

python

redundancy

parsing

screen-scraping