从 Scraped HTML 中删除重复的 Substrings/Elements?
Remove Duplicate Substrings/Elements from Scraped HTML?
我从一本 Kindle 电子书中摘录了一大堆 HTML。而且它有很多重复的元素和重复的子字符串。
长话短说,Kindle DRM 删除了我 90% 的注释,我使用它没有删除的位置数据将其全部恢复。但是Amazon的location数据有些不精确(对应150byte chunks),所以我最后冗余了很多。
示例:
<html>
<body>
<p>
aesar”), at the Battle of Pavia (1525).
</p>
<div height="0em">
</div>
<mbp:pagebreak>
</mbp:pagebreak>
<a id="filepos97755">
</a>
<h1 align="center" height="2em">
<font size="5">
<b>
KNOW WHEN
<br/>
TO RETIRE
</b>
</font>
</h1>
<div height="3em">
</div>
<p align="justify" height="0em" width="1em">
</p>
</body>
</html>
<html>
<body>
<h1 align="center" height="2em">
<font size="5">
<b>
KNOW WHEN
<br/>
TO RETIRE
</b>
</font>
</h1>
<div height="3em">
</div>
<p align="justify" height="0em" width="1em">
Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
</p>
<div height="0em">
</div>
<p height="0em">
</p>
</body>
</html>
<html>
<body>
<p align="justify" height="0em" width="1em">
Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
</p>
<div height="0em">
</div>
<p align="justify" height="0em" width="1em">
It takes great foresight to predict the decline of a restless, relentless wheel. The sharpest gamblers know when to quit
</p>
</body>
</html>
有没有人知道什么可能有帮助?
天哪,真是一团糟。从你展示的一小部分输出来看,重要的东西似乎在段落标签中。我会使用漂亮的汤 python (http://www.crummy.com/software/BeautifulSoup/bs4/doc/) 从 <P>
标签中提取所有信息,然后删除多余的信息。如果您还想保留其他格式,那就太麻烦了。我回去后会尝试使用漂亮的汤,但我确信我无法以更好的格式导出它。
我从一本 Kindle 电子书中摘录了一大堆 HTML。而且它有很多重复的元素和重复的子字符串。
长话短说,Kindle DRM 删除了我 90% 的注释,我使用它没有删除的位置数据将其全部恢复。但是Amazon的location数据有些不精确(对应150byte chunks),所以我最后冗余了很多。
示例:
<html>
<body>
<p>
aesar”), at the Battle of Pavia (1525).
</p>
<div height="0em">
</div>
<mbp:pagebreak>
</mbp:pagebreak>
<a id="filepos97755">
</a>
<h1 align="center" height="2em">
<font size="5">
<b>
KNOW WHEN
<br/>
TO RETIRE
</b>
</font>
</h1>
<div height="3em">
</div>
<p align="justify" height="0em" width="1em">
</p>
</body>
</html>
<html>
<body>
<h1 align="center" height="2em">
<font size="5">
<b>
KNOW WHEN
<br/>
TO RETIRE
</b>
</font>
</h1>
<div height="3em">
</div>
<p align="justify" height="0em" width="1em">
Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
</p>
<div height="0em">
</div>
<p height="0em">
</p>
</body>
</html>
<html>
<body>
<p align="justify" height="0em" width="1em">
Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
</p>
<div height="0em">
</div>
<p align="justify" height="0em" width="1em">
It takes great foresight to predict the decline of a restless, relentless wheel. The sharpest gamblers know when to quit
</p>
</body>
</html>
有没有人知道什么可能有帮助?
天哪,真是一团糟。从你展示的一小部分输出来看,重要的东西似乎在段落标签中。我会使用漂亮的汤 python (http://www.crummy.com/software/BeautifulSoup/bs4/doc/) 从 <P>
标签中提取所有信息,然后删除多余的信息。如果您还想保留其他格式,那就太麻烦了。我回去后会尝试使用漂亮的汤,但我确信我无法以更好的格式导出它。