使用 BeautifulSoup 从范围 class 中提取锚文本
Extracting anchor text from span class with BeautifulSoup
这是我要抓取的html:
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
我想获取每个 href 的锚文本:电影、溶解、史诗等。
这是我的代码:
url = urllib2.urlopen("http: example.com")
content = url.read()
soup = BeautifulSoup(content)
links = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for link in links:
print link.find_all('a')['href']
如果我使用 "link.find_all" 执行此操作,我会收到错误消息:TypeError:列表索引必须是整数,而不是 str。
但是如果我打印 link.find('a')['href'] 我只会得到第一个。
我怎样才能得到所有这些?
您可以执行以下操作:
from bs4 import BeautifulSoup
content = '''
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
'''
soup = BeautifulSoup(content)
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
links = span.find_all('a')
for link in links:
print link['href']
输出
/tags/cinematic
/tags/dissolve
/tags/epic
/tags/fly
link.find_all('a')
returns 带有 bs4 标签的列表。您可能希望通过 href
为每个链接编制索引。所以也许这更接近您的需求:
span = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for links in span:
for link in links.find_all('a'):
print(link['href'])
from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
"""
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
for link in span.find_all('a'):
print link.text, link['href']
另一种更昂贵的方式可能是:
from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
"""
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a")
for link in links:
if 'meta-attributes__attr-tags' not in link.parent.get('class', []):
continue
print link.text, link['href']
您可以通过使用 CSS selector:
避免嵌套循环或循环内的任何附加 if 检查
for link in soup.select(".meta-attributes__attr-tags a[href]"):
print(link["href"], link.get_text())
这是我要抓取的html:
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
我想获取每个 href 的锚文本:电影、溶解、史诗等。
这是我的代码:
url = urllib2.urlopen("http: example.com")
content = url.read()
soup = BeautifulSoup(content)
links = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for link in links:
print link.find_all('a')['href']
如果我使用 "link.find_all" 执行此操作,我会收到错误消息:TypeError:列表索引必须是整数,而不是 str。
但是如果我打印 link.find('a')['href'] 我只会得到第一个。
我怎样才能得到所有这些?
您可以执行以下操作:
from bs4 import BeautifulSoup
content = '''
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
'''
soup = BeautifulSoup(content)
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
links = span.find_all('a')
for link in links:
print link['href']
输出
/tags/cinematic
/tags/dissolve
/tags/epic
/tags/fly
link.find_all('a')
returns 带有 bs4 标签的列表。您可能希望通过 href
为每个链接编制索引。所以也许这更接近您的需求:
span = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for links in span:
for link in links.find_all('a'):
print(link['href'])
from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
"""
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
for link in span.find_all('a'):
print link.text, link['href']
另一种更昂贵的方式可能是:
from bs4 import BeautifulSoup
html = """
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
"""
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a")
for link in links:
if 'meta-attributes__attr-tags' not in link.parent.get('class', []):
continue
print link.text, link['href']
您可以通过使用 CSS selector:
避免嵌套循环或循环内的任何附加 if 检查for link in soup.select(".meta-attributes__attr-tags a[href]"):
print(link["href"], link.get_text())