通过某些 HTML 结构拆分具有 BeautifulSoup 的文本
Splitting up text with BeautifulSoup by certain HTML structures
我正在尝试根据特定模式拆分一些 HTML。
HTML 的特定部分必须分为 1 个或多个部分或文本数组。我能够划分这个 HTML 的方法是查看第一个 <strong>
和一个双 <br />
。这两个标签之间的所有文本都必须放入列表中并进行迭代。
如何轻松解决这个问题?
所以我想要以下 HTML:
<div class="clearfix">
<!--# of ppl associated with place-->
This is some kind of buzzword:<br />
<br />
<!--Persontype-->
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More weird stuff
<br />
Unstructured text <br />
<br />
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As I would expect <br />
<br />
</div>
分为以下几部分。
第一部分:
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More weird stuff
<br />
Unstructured text <br />
<br />
第二部分:
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
第三部分:
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As I would expect <br />
<br />
</div>
基本解决方案是使用连接、美化和拆分。基本思路是将其转换成文本,将感兴趣的部分分开。
from bs4 import BeautifulSoup
soup = BeautifulSoup(''.join(text))
for i in soup.prettify().split('<!--Persontype-->')[1].split('<strong>'):
print '<strong>' + ''.join(i)
文本文件是:
text= '''
<div class="clearfix">
<!--# of ppl associated with place-->
This is some kind of buzzword:<br />
<br />
<!--Persontype-->
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More wierd stuff
<br />
Unstructured text <br />
<br />
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As i would expect <br />
<br />
</div>
'''
输出为:
Jimbo Jack
Some filler text
More wierd stuff
Unstructured text
Jacky Bradson
This is just a test
Nothing but a test
More unstructured stuff
Junior Bossman
This is fluffy
As i would expect
我正在尝试根据特定模式拆分一些 HTML。
HTML 的特定部分必须分为 1 个或多个部分或文本数组。我能够划分这个 HTML 的方法是查看第一个 <strong>
和一个双 <br />
。这两个标签之间的所有文本都必须放入列表中并进行迭代。
如何轻松解决这个问题?
所以我想要以下 HTML:
<div class="clearfix">
<!--# of ppl associated with place-->
This is some kind of buzzword:<br />
<br />
<!--Persontype-->
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More weird stuff
<br />
Unstructured text <br />
<br />
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As I would expect <br />
<br />
</div>
分为以下几部分。
第一部分:
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More weird stuff
<br />
Unstructured text <br />
<br />
第二部分:
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
第三部分:
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As I would expect <br />
<br />
</div>
基本解决方案是使用连接、美化和拆分。基本思路是将其转换成文本,将感兴趣的部分分开。
from bs4 import BeautifulSoup
soup = BeautifulSoup(''.join(text))
for i in soup.prettify().split('<!--Persontype-->')[1].split('<strong>'):
print '<strong>' + ''.join(i)
文本文件是:
text= '''
<div class="clearfix">
<!--# of ppl associated with place-->
This is some kind of buzzword:<br />
<br />
<!--Persontype-->
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More wierd stuff
<br />
Unstructured text <br />
<br />
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As i would expect <br />
<br />
</div>
'''
输出为:
Jimbo Jack
Some filler text
More wierd stuff
Unstructured text
Jacky Bradson
This is just a test
Nothing but a test
More unstructured stuff
Junior Bossman
This is fluffy
As i would expect