使用 python 解析和匹配 html 的奇怪结构

Question

首先，我想解析这个html:

<body>
    <div id="contents">
        <div id="links">
            <a href='url1'>link-1</a></div>
        <div id="info">
            <p>apple</p></div>
        <div id="links">
            <a href='url2'>link-2</a>
            <a href='url3'>link-3</a></div>
        <div id="info">
            <p>bear</p></div>
        <div id="links">
            <a href='url4'>link-4</a></div>
        <div id="info">
            <p>cat</p></div>
        <div id="links">
            <a href='url5'>link-5</a>
            <a href='url6'>link-6</a>
            <a href='url7'>link-7</a></div>
        <div id="info">
            <p>duck</p></div>
        <div id="links">
            <a href='url8'>link-8</a></div>
        <div id="info">
            <p>egg</p></div>
        #etc
    </div>
</body>

我这里的目的是"grab all link and info, and matching them"。但是，有8个链接和5个信息，所以不是很清楚。

def link_collect(soup):
    tempaddress = []
    link_list = [0]*10
    d = 0
    linksearch = soup.findAll("a")
    for r in linksearch:
        try:
            if "url" in r['href']:
                tempaddress.append(r['href']
        except:
            a=0
    for clearing in tempaddress:
        cleared = urlparse(str(clearing))
        clink = cleared.scheme + "://" + cleared.netloc + cleared.path
        link_list[d] = clink
        d = d+1
    link_list = delete_zero(link_list)
    return link_list

def info_collect(soup)
    tempinfo = soup.find_all(id="info")
    info_list = [0]*10
    d=0
    for r in tempinfo:
        infodata = r.get_text()
        info_list[d] = infodata
        d=d+1
    info_list = delete_zero(info_list)
    return info_list

targetpage = "http://address"
opening = urlopen(targetpage)
soup = BeautifulSoup(opening.read())
link = link_collect(soup)
info = info_collect(soup)
for n in range(0, len(info)):
    print(str(link[n]) + " = " + str(info[n]))

当运行时，结果如下：

url1 = apple
url2 = bear
url3 = cat
url4 = duck
url5 = egg

Error : url 6, 7, 8 can't match

但是，我想要这样的结果：

url1 = apple
url2 = bear
url3 = bear
url4 = cat
url5 = duck
etc

我怎样才能做出这样的东西？

Answer 1

您需要使用 find_all() method. Next use the zip 函数迭代 (div, info) 对 (div, info) 并再次使用 find_all() 方法获取 a 元素父元素 div为每个 div.

获取所有 link

In [85]: from bs4 import BeautifulSoup

In [86]: soup = BeautifulSoup("""<body>
    <div id="contents">
        <div id="links">
            <a href='url1'>link-1</a></div>
        <div id="info">
            <p>apple</p></div>
        <div id="links">
            <a href='url2'>link-2</a>
            <a href='url3'>link-3</a></div>
        <div id="info">
            <p>bear</p></div>
        <div id="links">
            <a href='url4'>link-4</a></div>
        <div id="info">
            <p>cat</p></div>
        <div id="links">
            <a href='url5'>link-5</a>
            <a href='url6'>link-6</a>
            <a href='url7'>link-7</a></div>
        <div id="info">
            <p>duck</p></div>
        <div id="links">
            <a href='url8'>link-8</a></div>
        <div id="info">
            <p>egg</p></div>
        #etc
    </div>
</body>""")

In [87]: links = soup.find_all('div', attrs={'id': 'links'})

In [88]: infos = soup.find_all('div', attrs={'id': 'info'})

In [157]: for lk, inf in zip(links, infos):
   .....:     for tag in lk.find_all('a'):
   .....:         print(inf.text, tag.attrs['href'])
   .....:         

apple url1

bear url2

bear url3

cat url4

duck url5

duck url6

duck url7

egg url8

使用 python 解析和匹配 html 的奇怪结构

parsing and matching weird structure of html using python

html

python

parsing

beautifulsoup

python-3.x