find_all() 在使用 Beautiful Soup + Requests 时找不到任何结果

Question

我是第一次尝试使用 BeautifulSoup 和 Requests，并试图通过从新闻站点抓取一些信息来学习。该项目的目的是能够从终端读取新闻亮点，因此我需要有效地抓取和解析文章标题和文章 body 文本。

我仍处于获取标题的阶段，但当我尝试使用 find_all() 函数时，我只是没有存储任何数据。下面是我的代码：

from bs4 import BeautifulSoup
from time import strftime
import requests

date = strftime("%Y/%m/%d")

url = "http://www.thedailybeast.com/cheat-sheets/" + date + "/cheat-sheet.html"

result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "lxml")

titles = soup.find_all('h1 class="title multiline"')

print titles

有什么想法吗？如果有人也有任何建议/技巧来改进我目前拥有的东西或我正在采取的方法，我一直在寻求变得更好所以请告诉！

干杯

Answer 1

您将此处的所有内容都用引号引起来：

titles = soup.find_all('h1 class="title multiline"')

这使得 BeautifulSoup 搜索 h1 class="title multiline" 个元素。

改为使用：

titles = soup.find_all("h1", class_="title multiline")

或者，CSS selector：

titles = soup.select("h1.title.multiline")

实际上，由于页面的动态特性，要获得所有标题，您必须采用不同的方法：

import json

results = json.loads(soup.find('div', {'data-pageraillist': True})['data-pageraillist'])
for result in results:
    print result["title"]

打印：

Hillary Email ‘Born Classified’
North Korean Internet Goes Down
Kid-Porn Cops Go to Gene Simmons’s Home
Baylor Player Convicted of Rape After Coverup
U.S. Calls In Aussie Wildfire Experts
Markets’ 2015 Gains Wiped Out
Black Lives Matters Unveils Platform
Sheriff Won’t Push Jenner Crash Charge 
Tear Gas Used on Migrants Near Macedonia
Franzen Considered Adopting Iraqi Orphan

Answer 2

你很接近，但是find_all只搜索标签，不像一般的搜索功能。

因此，如果您想按 class 等标签和属性进行过滤，请执行以下操作：

soup.find_all('h1', {'class' : 'multiline'})

find_all() 在使用 Beautiful Soup + Requests 时找不到任何结果

find_all() not finding any results when using Beautiful Soup + Requests

python

beautifulsoup

html-parsing

web-scraping

python-requests