尝试使用 BeautifulSoup 进行嵌套抓取
Attempting a Nested Scrape Using BeautifulSoup
我的代码如下:
<h1><a name="hello">Hello</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>My Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Your Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>
<h1 name="goodbye"><a>Goodbye</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>Their Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Our Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>
我没有正确循环代码,也不知道如何迭代,因为我一直将所有值组合在一起。有人可以引导我走上正确的轨道吗?我尝试使用 findNext()
、nextSibling()
、findAll()
方法,但我失败了。
我希望的输出是:
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye: Their Favorite Number is: 1
Goodbye: Their Favorite Number is: 2
Goodbye: Their Favorite Number is: 3
Goodbye: Their Favorite Number is: 4
Goodbye: Our Favorite Number is: 1
Goodbye: Our Favorite Number is: 2
Goodbye: Our Favorite Number is: 3
Goodbye: Our Favorite Number is: 4
如果您在使用 nextSibling
时遇到问题,那是因为您的 html 实际上看起来像这样:
<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">
看到 </h1>
之后的换行符了吗?即使换行符是不可见的,它仍然被认为是文本,因此它成为一个 BeautifulSoup 元素(一个 NavigableString),并且它被认为是 <h1>
标签的 nextSibling
。
换行符也可能在尝试获取以下内容时出现问题,例如,下面的第三个 child <div>
:
<div>
<div>hello</div>
<div>world</div>
<div>goodbye</div>
<div>
这里是children的编号:
<div>\n #<---newline plus spaces at start of next line = child 0
<div>hello</div>\n #<--newline plus spaces at start of next line = child 2
<div>world</div>\n #<--newline plus spaces at start of next line = child 4
<div>goodbye</div>\n #<--newline = child 6
<div>
div 实际上是 children 数字 1、3 和 5。如果您在解析 html 时遇到问题,那么 101% 的情况是因为每个行末尾的换行符线让你绊倒了。始终必须考虑换行符并将其纳入您对事物所在位置的思考。
要在此处获取 <div>
标签:
<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">
...你可以这样写:
h1.nextSibling.nextSibling
但是要跳过标签之间的所有空格,使用 findNextSibling()
更容易,它允许您指定要定位的下一个兄弟的标签名称:
findNextSibling('div')
这是一个例子:
from BeautifulSoup import BeautifulSoup
with open('data2.txt') as f:
html = f.read()
soup = BeautifulSoup(html)
for h1 in soup.findAll('h1'):
colmask_div = h1.findNextSibling('div')
for box_div in colmask_div.findAll('div'):
h4 = box_div.find('h4')
for ul in box_div.findAll('ul'):
print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)
--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4
我的代码如下:
<h1><a name="hello">Hello</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>My Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Your Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>
<h1 name="goodbye"><a>Goodbye</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>Their Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Our Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>
我没有正确循环代码,也不知道如何迭代,因为我一直将所有值组合在一起。有人可以引导我走上正确的轨道吗?我尝试使用 findNext()
、nextSibling()
、findAll()
方法,但我失败了。
我希望的输出是:
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye: Their Favorite Number is: 1
Goodbye: Their Favorite Number is: 2
Goodbye: Their Favorite Number is: 3
Goodbye: Their Favorite Number is: 4
Goodbye: Our Favorite Number is: 1
Goodbye: Our Favorite Number is: 2
Goodbye: Our Favorite Number is: 3
Goodbye: Our Favorite Number is: 4
如果您在使用 nextSibling
时遇到问题,那是因为您的 html 实际上看起来像这样:
<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">
看到 </h1>
之后的换行符了吗?即使换行符是不可见的,它仍然被认为是文本,因此它成为一个 BeautifulSoup 元素(一个 NavigableString),并且它被认为是 <h1>
标签的 nextSibling
。
换行符也可能在尝试获取以下内容时出现问题,例如,下面的第三个 child <div>
:
<div>
<div>hello</div>
<div>world</div>
<div>goodbye</div>
<div>
这里是children的编号:
<div>\n #<---newline plus spaces at start of next line = child 0
<div>hello</div>\n #<--newline plus spaces at start of next line = child 2
<div>world</div>\n #<--newline plus spaces at start of next line = child 4
<div>goodbye</div>\n #<--newline = child 6
<div>
div 实际上是 children 数字 1、3 和 5。如果您在解析 html 时遇到问题,那么 101% 的情况是因为每个行末尾的换行符线让你绊倒了。始终必须考虑换行符并将其纳入您对事物所在位置的思考。
要在此处获取 <div>
标签:
<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">
...你可以这样写:
h1.nextSibling.nextSibling
但是要跳过标签之间的所有空格,使用 findNextSibling()
更容易,它允许您指定要定位的下一个兄弟的标签名称:
findNextSibling('div')
这是一个例子:
from BeautifulSoup import BeautifulSoup
with open('data2.txt') as f:
html = f.read()
soup = BeautifulSoup(html)
for h1 in soup.findAll('h1'):
colmask_div = h1.findNextSibling('div')
for box_div in colmask_div.findAll('div'):
h4 = box_div.find('h4')
for ul in box_div.findAll('ul'):
print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)
--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4