美汤能不能根据spanclass=给一大块文件?

Can beautiful soup give a chuck of file based on span class=?

我正在尝试从 HTML 页面中提取一些简单的字段。这是一个包含一些重复数据的 table。

每条记录都有一个 FIRST_NAME(和一堆其他东西)但不是每条记录都有一个网站。所以我的 xpath 解决方案返回了 10 个名称,但只有 9 个网站 url。

fname= tree.xpath('//span[@class="given-name"]/text()')
fweb = tree.xpath('//a[@class="url"]/text()')

使用该方法我无法判断哪个记录缺少 url。

所以现在我想把文件分成块;每个块将以范围 class GIVEN-NAME 开始,并在下一个 GIVEN-NAME 之前结束。

我该怎么做?在我的代码中,我有一个无限循环,它不断返回 span class FIRST-NAME 的第一个实例,它不会在 HTML 文件中取得进展。

with open('sample A.htm') as f:
    soup = bs4.BeautifulSoup(f.read())

    many_names= soup.find_all('span',class_='given-name')
    print len(many_names)

    for i in range(len(many_names)):
        first_name = soup.find('span', class_='given-name').text 
        website = soup.find('a', class_='url').text
        myprint (i, first_name, last_name, aco, city, qm, website)
        soup.find_next('span', class_='given-name')

最后一条语句(find_next)似乎没有做任何事情。

不管有没有它,它只是循环从头一遍又一遍地读取。正确的做法是什么?

编辑:来自 HTML 文件的样本(我编辑了一些因为还有更多) 物理上,布局是 span given-name blah blah blah URL 埋在某处,然后是另一个 span 名字

enter code here
</div>

<div class="connections-list cn-list-body cn-clear" id="cn-list-body">
<div class="cn-list-section-head" id="cn-char-A"></div><div class="cn-list-row-alternate vcard individual art-literary-agents celebrity-nonfiction-literary-agents chick-lit-fiction-literary-agents commercial-fiction-literary-agents fiction-literary-agents film-entertainment-literary-agents history-nonfiction-literary-agents literary-fiction-literary-agents military-war-literary-agents multicultural-nonfiction-literary-agents multicultural-fiction-literary-agents music-literary-agents new-york-literary-agents-ny nonfiction-literary-agents photography-literary-agents pop-culture-literary-agents religion-nonfiction-literary-agents short-story-collection-literary-agents spirituality-literary-agents sports-nonfiction-literary-agents usa-literary-agents womens-issues-literary-agents" id="richard-abate" data-entry-type="individual" data-entry-id="19337" data-entry-slug="richard-abate"><div id="entry-id-193375501ffd6551a6" class="cn-entry">
    <table border="0px" bordercolor="#E3E3E3" cellspacing="0px" cellpadding="0px">
        <tr>
            <td align="left" width="55%" valign="top">

<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Richard Abate" title="Logo for Richard Abate" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/richard-abate/richard-abate-literary-agent_logo_1-7bbdb1a0dbafe8417e994150608c55e4.jpg 1x" /></span></span>
            </td>
            <td align="right" valign="top" style="text-align: right;">

                <div style="clear:both; margin: 5px 5px;">
                    <div style="margin-bottom: 5px;">

<span class="fn n"> <span class="given-name">Richard</span> <span class="family-name">Abate</span> </span>

<span class="title">3 Arts Entertainment</span>

<span class="org"><span class="organization-unit">Query method(s): Postal Mail *</span></span>
                                            </div>


<span class="address-block">
<span class="adr"><span class="address-name">Work</span> <span class="street-address">16 West 22th St</span> <span class="locality">New York</span> <span class="region">NY</span> <span class="postal-code">10010</span> <span class="country-name">USA</span><span class="type" style="display: none;">work</span></span>
</span>

                </div>
            </td>
        </tr>

        <tr>
            <td valign="bottom" style="text-align: left;">

                <a class="cn-note-anchor toggle-div" id="note-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="note-block-193375501ffd6551a6" data-str-show="Show Notes" data-str-hide="Close Notes">Show Notes</a> | <a class="cn-bio-anchor toggle-div" id="bio-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="bio-block-193375501ffd6551a6" data-str-show="Show Bio" data-str-hide="Close Bio">Show Bio</a>              
            </td>

            <td align="right" valign="bottom"  style="text-align: right;">

                <a class="url" href="http://www.3arts.com" target="new" rel="nofollow">http://www.3arts.com</a>


<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Andree Abecassis" title="Logo for Andree Abecassis" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/andree-abecassis/andree-abecassis-literary-agent_logo_1-b531cbac02864497b301e74bc6b37aa9.jpg 1x" /></span></span>
            </td>
            <td align="right" valign="top" style="text-align: right;">

                <div style="clear:both; margin: 5px 5px;">
                    <div style="margin-bottom: 5px;">

<span class="fn n"> <span class="given-name">Andree</span> <span class="family-name">Abecassis</span> </span>

enter code here

我很确定不是这种情况,假设您已正确复制并粘贴代码,最后一条语句给您 SyntaxError说;相反,它会给你一个 AttributeError 因为你拼错了方法名称 findNext 调用它,而不是 find_next 出于某种神秘的原因。一般来说,复制并粘贴你的回溯而不是尝试"paraphrase"它。

但是,由于您已经有了包含相关 class 的所有范围的列表,最简单的方法是更改​​第二个循环以在每个范围内进行搜索:

for i, a_span in enumerate(many_names):
    first_name = a_span.text 
    website = a_span.find('a', class_='url')
    if website is None:
        website = '*MISSING*'
    else:
        website = website.text
    last_name = aco = city = qm = 'YOU NEVER EXTRACT THESE!!!'
    myprint(i, first_name, last_name, aco, city, qm, website)

假设您确实定义了一个包含所有这些参数的函数myprint

你会注意到我已经设置了四个变量来提醒你永远不要提取这些值——我怀疑你会想要修复,对吗?-)

编辑:现在看来,所寻找的标签之间的关系在 HTML 的结构中是 而不是 ,但是对 [=] 的脆弱依赖40=]sequence 标记在 HTML 文本中的出现,需要一种非常不同的方法。这是一种可能性:

from bs4 import BeautifulSoup

with open('ha.txt') as f:
    soup = BeautifulSoup(f)

def tag_of_interest(t):
    if t.name=='a': return t.attrs.get('class')==['url']
    if t.name=='span': return t.attrs.get('class')==['given-name']
    return False

for t in soup.find_all(tag_of_interest):
    print(t)

例如,当我在 ha.txt 中保存编辑后现在在 Q 中给出的 HTML 片段时,此脚本发出:

<span class="given-name">Richard</span>
<a class="url" href="http://www.3arts.com" rel="nofollow" target="new">http://www.3arts.com</a>
<span class="given-name">Andree</span>

所以现在剩下的就是适当地 group 标签的相关序列(我认为这也会包括其他标签,例如 class last-name &c). class 似乎是合适的(myprint 等功能可以巧妙地重铸为 class 的方法,但我将跳过该部分)。

class Entity(object):
    def __init__(self)
        self.first_name = self.last_name = self.website = None  # &c

entities = []

for t in soup.find_all(tag_of_interest):
    if t.name=='span' and t.class==['given-name']:
        ent = Entity()
        ent.given-name = t.text
        entities.append(ent)
    else:
        if not entities:
            print 'tag', t, 'out of context'
            continue
        ent = entities[-1]
        if t.name=='a' and t.class==['url']:
            ent.website = t.text
        # etc for other tags of interest

最后,可以检查 entities 列表中是否存在缺少强制性数据位的实体,等等。