美汤能不能根据spanclass=给一大块文件?
Can beautiful soup give a chuck of file based on span class=?
我正在尝试从 HTML 页面中提取一些简单的字段。这是一个包含一些重复数据的 table。
每条记录都有一个 FIRST_NAME(和一堆其他东西)但不是每条记录都有一个网站。所以我的 xpath 解决方案返回了 10 个名称,但只有 9 个网站 url。
fname= tree.xpath('//span[@class="given-name"]/text()')
fweb = tree.xpath('//a[@class="url"]/text()')
使用该方法我无法判断哪个记录缺少 url。
所以现在我想把文件分成块;每个块将以范围 class GIVEN-NAME 开始,并在下一个 GIVEN-NAME 之前结束。
我该怎么做?在我的代码中,我有一个无限循环,它不断返回 span class FIRST-NAME 的第一个实例,它不会在 HTML 文件中取得进展。
with open('sample A.htm') as f:
soup = bs4.BeautifulSoup(f.read())
many_names= soup.find_all('span',class_='given-name')
print len(many_names)
for i in range(len(many_names)):
first_name = soup.find('span', class_='given-name').text
website = soup.find('a', class_='url').text
myprint (i, first_name, last_name, aco, city, qm, website)
soup.find_next('span', class_='given-name')
最后一条语句(find_next)似乎没有做任何事情。
不管有没有它,它只是循环从头一遍又一遍地读取。正确的做法是什么?
编辑:来自 HTML 文件的样本(我编辑了一些因为还有更多)
物理上,布局是 span given-name blah blah blah URL 埋在某处,然后是另一个 span 名字
enter code here
</div>
<div class="connections-list cn-list-body cn-clear" id="cn-list-body">
<div class="cn-list-section-head" id="cn-char-A"></div><div class="cn-list-row-alternate vcard individual art-literary-agents celebrity-nonfiction-literary-agents chick-lit-fiction-literary-agents commercial-fiction-literary-agents fiction-literary-agents film-entertainment-literary-agents history-nonfiction-literary-agents literary-fiction-literary-agents military-war-literary-agents multicultural-nonfiction-literary-agents multicultural-fiction-literary-agents music-literary-agents new-york-literary-agents-ny nonfiction-literary-agents photography-literary-agents pop-culture-literary-agents religion-nonfiction-literary-agents short-story-collection-literary-agents spirituality-literary-agents sports-nonfiction-literary-agents usa-literary-agents womens-issues-literary-agents" id="richard-abate" data-entry-type="individual" data-entry-id="19337" data-entry-slug="richard-abate"><div id="entry-id-193375501ffd6551a6" class="cn-entry">
<table border="0px" bordercolor="#E3E3E3" cellspacing="0px" cellpadding="0px">
<tr>
<td align="left" width="55%" valign="top">
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Richard Abate" title="Logo for Richard Abate" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/richard-abate/richard-abate-literary-agent_logo_1-7bbdb1a0dbafe8417e994150608c55e4.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Richard</span> <span class="family-name">Abate</span> </span>
<span class="title">3 Arts Entertainment</span>
<span class="org"><span class="organization-unit">Query method(s): Postal Mail *</span></span>
</div>
<span class="address-block">
<span class="adr"><span class="address-name">Work</span> <span class="street-address">16 West 22th St</span> <span class="locality">New York</span> <span class="region">NY</span> <span class="postal-code">10010</span> <span class="country-name">USA</span><span class="type" style="display: none;">work</span></span>
</span>
</div>
</td>
</tr>
<tr>
<td valign="bottom" style="text-align: left;">
<a class="cn-note-anchor toggle-div" id="note-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="note-block-193375501ffd6551a6" data-str-show="Show Notes" data-str-hide="Close Notes">Show Notes</a> | <a class="cn-bio-anchor toggle-div" id="bio-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="bio-block-193375501ffd6551a6" data-str-show="Show Bio" data-str-hide="Close Bio">Show Bio</a>
</td>
<td align="right" valign="bottom" style="text-align: right;">
<a class="url" href="http://www.3arts.com" target="new" rel="nofollow">http://www.3arts.com</a>
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Andree Abecassis" title="Logo for Andree Abecassis" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/andree-abecassis/andree-abecassis-literary-agent_logo_1-b531cbac02864497b301e74bc6b37aa9.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Andree</span> <span class="family-name">Abecassis</span> </span>
enter code here
我很确定不是这种情况,假设您已正确复制并粘贴代码,最后一条语句给您 SyntaxError
说;相反,它会给你一个 AttributeError
因为你拼错了方法名称 findNext
调用它,而不是 find_next
出于某种神秘的原因。一般来说,复制并粘贴你的回溯而不是尝试"paraphrase"它。
但是,由于您已经有了包含相关 class 的所有范围的列表,最简单的方法是更改第二个循环以在每个范围内进行搜索:
for i, a_span in enumerate(many_names):
first_name = a_span.text
website = a_span.find('a', class_='url')
if website is None:
website = '*MISSING*'
else:
website = website.text
last_name = aco = city = qm = 'YOU NEVER EXTRACT THESE!!!'
myprint(i, first_name, last_name, aco, city, qm, website)
假设您确实定义了一个包含所有这些参数的函数myprint
。
你会注意到我已经设置了四个变量来提醒你永远不要提取这些值——我怀疑你会想要修复,对吗?-)
编辑:现在看来,所寻找的标签之间的关系在 HTML 的结构中是 而不是 ,但是对 [=] 的脆弱依赖40=]sequence 标记在 HTML 文本中的出现,需要一种非常不同的方法。这是一种可能性:
from bs4 import BeautifulSoup
with open('ha.txt') as f:
soup = BeautifulSoup(f)
def tag_of_interest(t):
if t.name=='a': return t.attrs.get('class')==['url']
if t.name=='span': return t.attrs.get('class')==['given-name']
return False
for t in soup.find_all(tag_of_interest):
print(t)
例如,当我在 ha.txt
中保存编辑后现在在 Q 中给出的 HTML 片段时,此脚本发出:
<span class="given-name">Richard</span>
<a class="url" href="http://www.3arts.com" rel="nofollow" target="new">http://www.3arts.com</a>
<span class="given-name">Andree</span>
所以现在剩下的就是适当地 group 标签的相关序列(我认为这也会包括其他标签,例如 class last-name
&c). class
似乎是合适的(myprint
等功能可以巧妙地重铸为 class 的方法,但我将跳过该部分)。
class Entity(object):
def __init__(self)
self.first_name = self.last_name = self.website = None # &c
entities = []
for t in soup.find_all(tag_of_interest):
if t.name=='span' and t.class==['given-name']:
ent = Entity()
ent.given-name = t.text
entities.append(ent)
else:
if not entities:
print 'tag', t, 'out of context'
continue
ent = entities[-1]
if t.name=='a' and t.class==['url']:
ent.website = t.text
# etc for other tags of interest
最后,可以检查 entities
列表中是否存在缺少强制性数据位的实体,等等。
我正在尝试从 HTML 页面中提取一些简单的字段。这是一个包含一些重复数据的 table。
每条记录都有一个 FIRST_NAME(和一堆其他东西)但不是每条记录都有一个网站。所以我的 xpath 解决方案返回了 10 个名称,但只有 9 个网站 url。
fname= tree.xpath('//span[@class="given-name"]/text()')
fweb = tree.xpath('//a[@class="url"]/text()')
使用该方法我无法判断哪个记录缺少 url。
所以现在我想把文件分成块;每个块将以范围 class GIVEN-NAME 开始,并在下一个 GIVEN-NAME 之前结束。
我该怎么做?在我的代码中,我有一个无限循环,它不断返回 span class FIRST-NAME 的第一个实例,它不会在 HTML 文件中取得进展。
with open('sample A.htm') as f:
soup = bs4.BeautifulSoup(f.read())
many_names= soup.find_all('span',class_='given-name')
print len(many_names)
for i in range(len(many_names)):
first_name = soup.find('span', class_='given-name').text
website = soup.find('a', class_='url').text
myprint (i, first_name, last_name, aco, city, qm, website)
soup.find_next('span', class_='given-name')
最后一条语句(find_next)似乎没有做任何事情。
不管有没有它,它只是循环从头一遍又一遍地读取。正确的做法是什么?
编辑:来自 HTML 文件的样本(我编辑了一些因为还有更多) 物理上,布局是 span given-name blah blah blah URL 埋在某处,然后是另一个 span 名字
enter code here
</div>
<div class="connections-list cn-list-body cn-clear" id="cn-list-body">
<div class="cn-list-section-head" id="cn-char-A"></div><div class="cn-list-row-alternate vcard individual art-literary-agents celebrity-nonfiction-literary-agents chick-lit-fiction-literary-agents commercial-fiction-literary-agents fiction-literary-agents film-entertainment-literary-agents history-nonfiction-literary-agents literary-fiction-literary-agents military-war-literary-agents multicultural-nonfiction-literary-agents multicultural-fiction-literary-agents music-literary-agents new-york-literary-agents-ny nonfiction-literary-agents photography-literary-agents pop-culture-literary-agents religion-nonfiction-literary-agents short-story-collection-literary-agents spirituality-literary-agents sports-nonfiction-literary-agents usa-literary-agents womens-issues-literary-agents" id="richard-abate" data-entry-type="individual" data-entry-id="19337" data-entry-slug="richard-abate"><div id="entry-id-193375501ffd6551a6" class="cn-entry">
<table border="0px" bordercolor="#E3E3E3" cellspacing="0px" cellpadding="0px">
<tr>
<td align="left" width="55%" valign="top">
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Richard Abate" title="Logo for Richard Abate" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/richard-abate/richard-abate-literary-agent_logo_1-7bbdb1a0dbafe8417e994150608c55e4.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Richard</span> <span class="family-name">Abate</span> </span>
<span class="title">3 Arts Entertainment</span>
<span class="org"><span class="organization-unit">Query method(s): Postal Mail *</span></span>
</div>
<span class="address-block">
<span class="adr"><span class="address-name">Work</span> <span class="street-address">16 West 22th St</span> <span class="locality">New York</span> <span class="region">NY</span> <span class="postal-code">10010</span> <span class="country-name">USA</span><span class="type" style="display: none;">work</span></span>
</span>
</div>
</td>
</tr>
<tr>
<td valign="bottom" style="text-align: left;">
<a class="cn-note-anchor toggle-div" id="note-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="note-block-193375501ffd6551a6" data-str-show="Show Notes" data-str-hide="Close Notes">Show Notes</a> | <a class="cn-bio-anchor toggle-div" id="bio-anchor-193375501ffd6551a6" href="#" data-uuid="193375501ffd6551a6" data-div-id="bio-block-193375501ffd6551a6" data-str-show="Show Bio" data-str-hide="Close Bio">Show Bio</a>
</td>
<td align="right" valign="bottom" style="text-align: right;">
<a class="url" href="http://www.3arts.com" target="new" rel="nofollow">http://www.3arts.com</a>
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 125px"><img height="125" width="125" sizes="100vw" class="cn-image logo" alt="Logo for Andree Abecassis" title="Logo for Andree Abecassis" srcset="http://literaryagencies.com/wp-content/uploads/connections-images/andree-abecassis/andree-abecassis-literary-agent_logo_1-b531cbac02864497b301e74bc6b37aa9.jpg 1x" /></span></span>
</td>
<td align="right" valign="top" style="text-align: right;">
<div style="clear:both; margin: 5px 5px;">
<div style="margin-bottom: 5px;">
<span class="fn n"> <span class="given-name">Andree</span> <span class="family-name">Abecassis</span> </span>
enter code here
我很确定不是这种情况,假设您已正确复制并粘贴代码,最后一条语句给您 SyntaxError
说;相反,它会给你一个 AttributeError
因为你拼错了方法名称 findNext
调用它,而不是 find_next
出于某种神秘的原因。一般来说,复制并粘贴你的回溯而不是尝试"paraphrase"它。
但是,由于您已经有了包含相关 class 的所有范围的列表,最简单的方法是更改第二个循环以在每个范围内进行搜索:
for i, a_span in enumerate(many_names):
first_name = a_span.text
website = a_span.find('a', class_='url')
if website is None:
website = '*MISSING*'
else:
website = website.text
last_name = aco = city = qm = 'YOU NEVER EXTRACT THESE!!!'
myprint(i, first_name, last_name, aco, city, qm, website)
假设您确实定义了一个包含所有这些参数的函数myprint
。
你会注意到我已经设置了四个变量来提醒你永远不要提取这些值——我怀疑你会想要修复,对吗?-)
编辑:现在看来,所寻找的标签之间的关系在 HTML 的结构中是 而不是 ,但是对 [=] 的脆弱依赖40=]sequence 标记在 HTML 文本中的出现,需要一种非常不同的方法。这是一种可能性:
from bs4 import BeautifulSoup
with open('ha.txt') as f:
soup = BeautifulSoup(f)
def tag_of_interest(t):
if t.name=='a': return t.attrs.get('class')==['url']
if t.name=='span': return t.attrs.get('class')==['given-name']
return False
for t in soup.find_all(tag_of_interest):
print(t)
例如,当我在 ha.txt
中保存编辑后现在在 Q 中给出的 HTML 片段时,此脚本发出:
<span class="given-name">Richard</span>
<a class="url" href="http://www.3arts.com" rel="nofollow" target="new">http://www.3arts.com</a>
<span class="given-name">Andree</span>
所以现在剩下的就是适当地 group 标签的相关序列(我认为这也会包括其他标签,例如 class last-name
&c). class
似乎是合适的(myprint
等功能可以巧妙地重铸为 class 的方法,但我将跳过该部分)。
class Entity(object):
def __init__(self)
self.first_name = self.last_name = self.website = None # &c
entities = []
for t in soup.find_all(tag_of_interest):
if t.name=='span' and t.class==['given-name']:
ent = Entity()
ent.given-name = t.text
entities.append(ent)
else:
if not entities:
print 'tag', t, 'out of context'
continue
ent = entities[-1]
if t.name=='a' and t.class==['url']:
ent.website = t.text
# etc for other tags of interest
最后,可以检查 entities
列表中是否存在缺少强制性数据位的实体,等等。