使用 BeautifulSoup 提取 link 的标题
Using BeautifulSoup to extract the title of a link
我正在尝试使用 BeautifulSoup 提取 link 的标题。我正在使用的代码如下:
url = "http://www.example.com"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'}):
title = link.get('title')
print title
现在,示例 link
元素包含以下内容:
<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>
但是,在我运行上面的代码之后,什么也没有显示。如何提取存储在 link
中的锚标记的 title
属性中的值?
嗯,您似乎在 s-access-detail-page
和 a-text-normal
之间放置了两个空格,这反过来找不到任何匹配的 link。尝试使用正确数量的空格,然后打印找到的 link 数量。此外,您可以打印标签本身 - print link
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.in/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=python"
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "lxml")
links = soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'})
print len(links)
for link in links:
title = link.get('title')
print title
您正在此处使用多个 class 搜索 精确字符串 。在这种情况下,class 字符串必须与 完全匹配 ,并带有单个空格。
参见文档中的Searching by CSS class section:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
您可以更好地搜索个人 classes:
soup.find_all('a', class_='a-link-normal')
如果您必须匹配多个class,请使用CSS selector:
soup.select('a.a-link-normal.s-access-detail-page.a-text-normal')
而且 class 的排列顺序无关紧要。
演示:
>>> from bs4 import BeautifulSoup
>>> plain_text = u'<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>'
>>> soup = BeautifulSoup(plain_text)
>>> for link in soup.find_all('a', class_='a-link-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python
>>> for link in soup.select('a.a-link-normal.s-access-detail-page.a-text-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python
我正在尝试使用 BeautifulSoup 提取 link 的标题。我正在使用的代码如下:
url = "http://www.example.com"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'}):
title = link.get('title')
print title
现在,示例 link
元素包含以下内容:
<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>
但是,在我运行上面的代码之后,什么也没有显示。如何提取存储在 link
中的锚标记的 title
属性中的值?
嗯,您似乎在 s-access-detail-page
和 a-text-normal
之间放置了两个空格,这反过来找不到任何匹配的 link。尝试使用正确数量的空格,然后打印找到的 link 数量。此外,您可以打印标签本身 - print link
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.in/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=python"
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "lxml")
links = soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'})
print len(links)
for link in links:
title = link.get('title')
print title
您正在此处使用多个 class 搜索 精确字符串 。在这种情况下,class 字符串必须与 完全匹配 ,并带有单个空格。
参见文档中的Searching by CSS class section:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout") # [<p class="body strikeout"></p>]
But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body") # []
您可以更好地搜索个人 classes:
soup.find_all('a', class_='a-link-normal')
如果您必须匹配多个class,请使用CSS selector:
soup.select('a.a-link-normal.s-access-detail-page.a-text-normal')
而且 class 的排列顺序无关紧要。
演示:
>>> from bs4 import BeautifulSoup
>>> plain_text = u'<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>'
>>> soup = BeautifulSoup(plain_text)
>>> for link in soup.find_all('a', class_='a-link-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python
>>> for link in soup.select('a.a-link-normal.s-access-detail-page.a-text-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python