Python 来自 url 的 ftech 标题和 pdf link

Question

我正在尝试从 url 中获取书名和嵌入 url link 的书籍，url 的 html 源内容看起来像下面，我只是从中取出了一小部分来理解。

当link名字是here..然而小源html部分如下..

<section>
  <div class="book row" isbn-data="1601982941">
    <div class="col-lg-3">
      <div class="book-cats">Artificial Intelligence</div>
      <div style="width:100%;">
        <img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
      </div>
    </div>
    <div class="col-lg-6">
      <div class="star-ratings"></div>
      <h2>Learning Deep Architectures for AI</h2>
      <span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
      <div class="meta-auth-ttl"></div>
      <p>Foundations and Trends(r) in Machine Learning.</p>
      <div>
        <a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
        <a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
      </div>
    </div>
  </div>
</section>
<section>
  <div class="book row" isbn-data="1496034023">
    <div class="col-lg-3">
      <div class="book-cats">Artificial Intelligence</div>
      <div style="width:100%;">
        <img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
      </div>
    </div>
    <div class="col-lg-6">
      <div class="star-ratings"></div>
      <h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
      <span class="meta-auth"><b>Roberto Battiti &amp; Mauro Brunato, 2013</b></span>
      <div class="meta-auth-ttl"></div>
      <p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
      <div>
        <a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
        <a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
      </div>
    </div>
  </div>
</section>

我试过下面的代码：

此代码仅获取书名或书名，但仍有 header <h2> 打印。我也期待印刷 Book name 和本书的 pdf link。

#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq


web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()

soup = bs(web_res, 'html.parser')

headers = soup.find_all(['h2'])
print(*headers, sep='\n')

#divs = soup.find_all('div')
#print(*divs, sep="\n\n")

header_1 = soup.find_all('h2', class_='book-container')
print(header_1)

输出：

<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>

期望的输出：

Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf

请帮助我了解如何实现这一目标，因为我已经在谷歌上搜索过了，但由于缺乏知识，我无法获得它。当我看到 html 来源时，有很多 div 和 class ，所以很少混淆选择哪个 class 来获取 href 和 h2.

Answer 1

你可以从这段代码中得到主要思想：

for items in zip(soup.find_all(['h2']), soup.find_all('a', class_="btn")):
    h2, href = items[0].text, items[1].get('href')
    print('Title:', h2)
    print('Link:', href)

Answer 2

HTML 的结构非常好，您可以在这里使用它。该站点显然使用 Bootstrap 作为样式脚手架（row 和 col-[size]-[gridcount] class 是你几乎可以忽略的。

你基本上有：

一本 <div class="book"> 每本书
- 一列
  - <div class="book-cats"> 类别和
  - 图片
- 第二列
  - <div class="star-ratings"> 评分块
  - <h2> 书名
  - <span class="meta-auth">作者行
  - <p> 图书描述
  - 两个 link 与 <a class=“btn" ...>

其中大部分都可以忽略。标题和你想要的 link 都是它们类型的第一个元素，所以你可以只使用 element.nested_element 来获取任何一个。

所以你所要做的就是

遍历所有 book divs.
对于每个这样的 div，取 h2 和前 a 个元素。
对于标题，取 h2
对于 link，采用 a 锚点的 href 属性 link。

像这样：

for book in soup.select("div.book:has(h2):has(a.btn[href])"):
    title = book.h2.get_text(strip=True)
    link = book.select_one("a.btn[href]")["href"]
    # store or process title and link
    print("Title:", title)
    print("Link:", link)

我将 .select_one() 与 CSS 选择器一起使用，以便更具体地说明要接受的 link 元素； .btn 指定 class 和 [href] 必须存在 href 属性。

我还增强了图书搜索，将其限制为 div 既有书名又至少有 1 个 link； :has(...) selector 将匹配限制为具有特定 child 元素的匹配。

以上产生：

Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...

Python 来自 url 的 ftech 标题和 pdf link

Python ftech Title and pdf link from an url

html

python

urllib

beautifulsoup

web-scraping