使用 python 解析和匹配 html 的奇怪结构
parsing and matching weird structure of html using python
首先,我想解析这个html:
<body>
<div id="contents">
<div id="links">
<a href='url1'>link-1</a></div>
<div id="info">
<p>apple</p></div>
<div id="links">
<a href='url2'>link-2</a>
<a href='url3'>link-3</a></div>
<div id="info">
<p>bear</p></div>
<div id="links">
<a href='url4'>link-4</a></div>
<div id="info">
<p>cat</p></div>
<div id="links">
<a href='url5'>link-5</a>
<a href='url6'>link-6</a>
<a href='url7'>link-7</a></div>
<div id="info">
<p>duck</p></div>
<div id="links">
<a href='url8'>link-8</a></div>
<div id="info">
<p>egg</p></div>
#etc
</div>
</body>
我这里的目的是"grab all link and info, and matching them"。但是,有8个链接和5个信息,所以不是很清楚。
def link_collect(soup):
tempaddress = []
link_list = [0]*10
d = 0
linksearch = soup.findAll("a")
for r in linksearch:
try:
if "url" in r['href']:
tempaddress.append(r['href']
except:
a=0
for clearing in tempaddress:
cleared = urlparse(str(clearing))
clink = cleared.scheme + "://" + cleared.netloc + cleared.path
link_list[d] = clink
d = d+1
link_list = delete_zero(link_list)
return link_list
def info_collect(soup)
tempinfo = soup.find_all(id="info")
info_list = [0]*10
d=0
for r in tempinfo:
infodata = r.get_text()
info_list[d] = infodata
d=d+1
info_list = delete_zero(info_list)
return info_list
targetpage = "http://address"
opening = urlopen(targetpage)
soup = BeautifulSoup(opening.read())
link = link_collect(soup)
info = info_collect(soup)
for n in range(0, len(info)):
print(str(link[n]) + " = " + str(info[n]))
当 运行 时,结果如下:
url1 = apple
url2 = bear
url3 = cat
url4 = duck
url5 = egg
Error : url 6, 7, 8 can't match
但是,我想要这样的结果:
url1 = apple
url2 = bear
url3 = bear
url4 = cat
url5 = duck
etc
我怎样才能做出这样的东西?
您需要使用 find_all()
method. Next use the zip
函数迭代 (div, info)
对 (div, info)
并再次使用 find_all()
方法获取 a
元素父元素 div
为每个 div
.
获取所有 link
In [85]: from bs4 import BeautifulSoup
In [86]: soup = BeautifulSoup("""<body>
<div id="contents">
<div id="links">
<a href='url1'>link-1</a></div>
<div id="info">
<p>apple</p></div>
<div id="links">
<a href='url2'>link-2</a>
<a href='url3'>link-3</a></div>
<div id="info">
<p>bear</p></div>
<div id="links">
<a href='url4'>link-4</a></div>
<div id="info">
<p>cat</p></div>
<div id="links">
<a href='url5'>link-5</a>
<a href='url6'>link-6</a>
<a href='url7'>link-7</a></div>
<div id="info">
<p>duck</p></div>
<div id="links">
<a href='url8'>link-8</a></div>
<div id="info">
<p>egg</p></div>
#etc
</div>
</body>""")
In [87]: links = soup.find_all('div', attrs={'id': 'links'})
In [88]: infos = soup.find_all('div', attrs={'id': 'info'})
In [157]: for lk, inf in zip(links, infos):
.....: for tag in lk.find_all('a'):
.....: print(inf.text, tag.attrs['href'])
.....:
apple url1
bear url2
bear url3
cat url4
duck url5
duck url6
duck url7
egg url8
首先,我想解析这个html:
<body>
<div id="contents">
<div id="links">
<a href='url1'>link-1</a></div>
<div id="info">
<p>apple</p></div>
<div id="links">
<a href='url2'>link-2</a>
<a href='url3'>link-3</a></div>
<div id="info">
<p>bear</p></div>
<div id="links">
<a href='url4'>link-4</a></div>
<div id="info">
<p>cat</p></div>
<div id="links">
<a href='url5'>link-5</a>
<a href='url6'>link-6</a>
<a href='url7'>link-7</a></div>
<div id="info">
<p>duck</p></div>
<div id="links">
<a href='url8'>link-8</a></div>
<div id="info">
<p>egg</p></div>
#etc
</div>
</body>
我这里的目的是"grab all link and info, and matching them"。但是,有8个链接和5个信息,所以不是很清楚。
def link_collect(soup):
tempaddress = []
link_list = [0]*10
d = 0
linksearch = soup.findAll("a")
for r in linksearch:
try:
if "url" in r['href']:
tempaddress.append(r['href']
except:
a=0
for clearing in tempaddress:
cleared = urlparse(str(clearing))
clink = cleared.scheme + "://" + cleared.netloc + cleared.path
link_list[d] = clink
d = d+1
link_list = delete_zero(link_list)
return link_list
def info_collect(soup)
tempinfo = soup.find_all(id="info")
info_list = [0]*10
d=0
for r in tempinfo:
infodata = r.get_text()
info_list[d] = infodata
d=d+1
info_list = delete_zero(info_list)
return info_list
targetpage = "http://address"
opening = urlopen(targetpage)
soup = BeautifulSoup(opening.read())
link = link_collect(soup)
info = info_collect(soup)
for n in range(0, len(info)):
print(str(link[n]) + " = " + str(info[n]))
当 运行 时,结果如下:
url1 = apple
url2 = bear
url3 = cat
url4 = duck
url5 = egg
Error : url 6, 7, 8 can't match
但是,我想要这样的结果:
url1 = apple
url2 = bear
url3 = bear
url4 = cat
url5 = duck
etc
我怎样才能做出这样的东西?
您需要使用 find_all()
method. Next use the zip
函数迭代 (div, info)
对 (div, info)
并再次使用 find_all()
方法获取 a
元素父元素 div
为每个 div
.
In [85]: from bs4 import BeautifulSoup
In [86]: soup = BeautifulSoup("""<body>
<div id="contents">
<div id="links">
<a href='url1'>link-1</a></div>
<div id="info">
<p>apple</p></div>
<div id="links">
<a href='url2'>link-2</a>
<a href='url3'>link-3</a></div>
<div id="info">
<p>bear</p></div>
<div id="links">
<a href='url4'>link-4</a></div>
<div id="info">
<p>cat</p></div>
<div id="links">
<a href='url5'>link-5</a>
<a href='url6'>link-6</a>
<a href='url7'>link-7</a></div>
<div id="info">
<p>duck</p></div>
<div id="links">
<a href='url8'>link-8</a></div>
<div id="info">
<p>egg</p></div>
#etc
</div>
</body>""")
In [87]: links = soup.find_all('div', attrs={'id': 'links'})
In [88]: infos = soup.find_all('div', attrs={'id': 'info'})
In [157]: for lk, inf in zip(links, infos):
.....: for tag in lk.find_all('a'):
.....: print(inf.text, tag.attrs['href'])
.....:
apple url1
bear url2
bear url3
cat url4
duck url5
duck url6
duck url7
egg url8