如何从 <h2 class=section-heading>:BeautifulSoup 中的 <a> 中提取 link

Question

我正在尝试提取一个 link ，它是这样写的：

<h2 class="section-heading">
    <a href="http://www.nytimes.com/pages/arts/index.html">Arts »</a>
</h2>

我的代码是：

from bs4 import BeautifulSoup
import requests, re

def get_data():
    url='http://www.nytimes.com/'
    s_code=requests.get(url)
    plain_text = s_code.text
    soup = BeautifulSoup(plain_text)
    head_links=soup.findAll('h2', {'class':'section-heading'})

    for n in head_links :
       a = n.find('a')
       print a
       print n.get['href'] 
       #print a['href']
       #print n.get('href')
       #headings=n.text
       #links = n.get('href')
       #print headings, links

get_data()

类似 "print a" 只是打印出 <h2 class=section-heading> 内的整个 <a> 行，即

<a href="http://www.nytimes.com/pages/world/index.html">World »</a>

但是当我执行"print n.get['href']"时，它抛出一个错误；

print n.get['href'] 
TypeError: 'instancemethod' object has no attribute '__getitem__'

我是不是做错了什么？请帮忙

我在这里找不到类似的案例问题，我的问题在这里有点独特，我正在尝试提取特定 class 名称中的 link section-headings.

Answer 1

首先，您想要获取 a 元素的 href，因此您应该在该行访问 a 而不是 n。其次，它应该是

a.get('href')

或

a['href']

如果没有找到这样的属性，后一种形式会抛出，而前者会 return None，就像通常的 dictionary/mapping 界面一样。由于 .get 是一个方法，所以应该调用它 (.get(...)); indexing/element 访问对它不起作用 (.get[...])，这就是这个问题的问题。

注意，find 也可能在那里失败，returning None，也许您想遍历 n.find_all('a', href=True):

for n in head_links:
   for a in n.find_all('a', href=True):
       print(a['href'])

比使用 find_all 更容易的是使用 select 方法，该方法采用 CSS 选择器。在这里，通过一次操作，我们只获得 <h2 class="section-heading"> 中具有 href 属性的那些 <a> 元素，就像使用 JQuery 一样容易。

soup = BeautifulSoup(plain_text)
for a in soup.select('h2.section-heading a[href]'):
    print(a['href'])

（另外，请使用lower-case method names in any new code that you write）。

如何从 <h2 class=section-heading>:BeautifulSoup 中的 <a> 中提取 link

How to extract link from <a> inside the <h2 class=section-heading>:BeautifulSoup

python

beautifulsoup

python-requests

bs4