Article web Scraping 使用美汤
Article web Scraping using beautiful soup
我正在尝试从下面URL
获取文章正文,header和文章发表日期
https://www.argusmedia.com/en/news/2214037-us-hrc-prices-rise-as-supply-remains-tight
当我试图用 class "news-container cf" 抓取 'article' 容器时,它 returns 0 行。
#Reprex 代码
url = "https://www.argusmedia.com/en/news/2214037-us-hrc-prices-rise-as-supply-remains-tight"
# Request
r1 = requests.get(url, verify=False)
r1.status_code
print(r1.status_code)
# We'll save in coverpage the cover page content
coverpage = r1.content
# Soup creation
soup1 = BeautifulSoup(coverpage, "html5lib")
# News identification
coverpage_news = soup1.find_all('article' , class_ ='news-container cf')
len(coverpage_news) ```
因为是动态加载,需要直接调用API
import requests
data = requests.get('https://www.argusmedia.com/api/news/2214037/us-hrc-prices-rise-as-supply-remains-tight').json()
body = data['AmpBody']
title = data['Title']
date = data['PublishedDate']
year = data['PublishedYear']
print(body, title, date, year, sep='\n')
# <article><p class="lead">US hot-roll...
# US HRC: Prices rise as supply remains tight
# 11 May
# 2021
那个页面 运行s javascripts。
Requests
是一个 http 库,它不能 运行 javascript。
为了 'see' javscript 网页的 HTML 您需要处理页面上的所有代码并实际呈现内容。
一种方法是使用 requests_html
模块。
from requests_html import HTMLSession
session = HTMLSession()
resp = session.get('your_url')
# this command executes the javascripts
resp.html.render()
输出:
resp.text
{"AmpBody":"<article><p class=\"lead\">US hot-rolled coil (HRC) prices continued to trend upward as supplies remain tight and demand stays elevated...}
来自docs
如果您想搜索匹配两个或更多 CSS 类 的标签,您应该使用 CSS 选择器:
soup1.select("article.news-container.cf")
我正在尝试从下面URL
获取文章正文,header和文章发表日期https://www.argusmedia.com/en/news/2214037-us-hrc-prices-rise-as-supply-remains-tight
当我试图用 class "news-container cf" 抓取 'article' 容器时,它 returns 0 行。
#Reprex 代码
url = "https://www.argusmedia.com/en/news/2214037-us-hrc-prices-rise-as-supply-remains-tight"
# Request
r1 = requests.get(url, verify=False)
r1.status_code
print(r1.status_code)
# We'll save in coverpage the cover page content
coverpage = r1.content
# Soup creation
soup1 = BeautifulSoup(coverpage, "html5lib")
# News identification
coverpage_news = soup1.find_all('article' , class_ ='news-container cf')
len(coverpage_news) ```
因为是动态加载,需要直接调用API
import requests
data = requests.get('https://www.argusmedia.com/api/news/2214037/us-hrc-prices-rise-as-supply-remains-tight').json()
body = data['AmpBody']
title = data['Title']
date = data['PublishedDate']
year = data['PublishedYear']
print(body, title, date, year, sep='\n')
# <article><p class="lead">US hot-roll...
# US HRC: Prices rise as supply remains tight
# 11 May
# 2021
那个页面 运行s javascripts。
Requests
是一个 http 库,它不能 运行 javascript。
为了 'see' javscript 网页的 HTML 您需要处理页面上的所有代码并实际呈现内容。
一种方法是使用 requests_html
模块。
from requests_html import HTMLSession
session = HTMLSession()
resp = session.get('your_url')
# this command executes the javascripts
resp.html.render()
输出:
resp.text
{"AmpBody":"<article><p class=\"lead\">US hot-rolled coil (HRC) prices continued to trend upward as supplies remain tight and demand stays elevated...}
来自docs
如果您想搜索匹配两个或更多 CSS 类 的标签,您应该使用 CSS 选择器:
soup1.select("article.news-container.cf")