我需要帮助找到正确的 html 标题链接标签 url 我的网络抓取工具。我的抓取工具的目的是抓取标题、故事、链接
I need help finding the correct html tag for headline links url my web scraper. The purpose of of my scraper is to scrape headlines, stories ,links
当我 运行 抓取程序时,我的 Django 主页上的输出没问题,但是 url 显示错误消息 404 和其他文章显示我使用了错误的标签 https://www.coindesk.com/news/tag/crypto-lending the correct link url is https://www.coindesk.com/news/tag/crypto-lending。带有 link 的正确标签是
from bs4 import BeautifulSoup
import requests
crypto_headlines = []
def crypto_news():
""" user agent to facilitates end-user interaction with web content"""
headers = {
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
}
base_url ='https://www.coindesk.com/news'
source = requests.get(base_url).text
soup = BeautifulSoup(source, "html.parser")
articles = soup.find_all(class_ = 'text-content')
#print(len(articles))
#print(articles)
for article in articles:
try:
headline = article.h4.text.strip()
text = article.find(class_="card-text").text.strip()
link = base_url + article.a['href']
#img_url = base_url + article.image_src['src']
crypto_dict = {}
crypto_dict['Headline']= headline
crypto_dict['Text'] = text
crypto_dict['Link']= link
crypto_headlines.append(crypto_dict)
except AttributeError as ex:
print('Error:', ex)
print(crypto_headlines)
crypto_news()
你是从错误的 <a>
抓取的,你是从第一个 <a>
抓取的,但需要 link 在第二个 <a>
.
这是代码
link = base_url + article.find_all("a")[1]["href"]
只需更改此行即可解决您的问题!
当我 运行 抓取程序时,我的 Django 主页上的输出没问题,但是 url 显示错误消息 404 和其他文章显示我使用了错误的标签 https://www.coindesk.com/news/tag/crypto-lending the correct link url is https://www.coindesk.com/news/tag/crypto-lending。带有 link 的正确标签是
from bs4 import BeautifulSoup
import requests
crypto_headlines = []
def crypto_news():
""" user agent to facilitates end-user interaction with web content"""
headers = {
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
}
base_url ='https://www.coindesk.com/news'
source = requests.get(base_url).text
soup = BeautifulSoup(source, "html.parser")
articles = soup.find_all(class_ = 'text-content')
#print(len(articles))
#print(articles)
for article in articles:
try:
headline = article.h4.text.strip()
text = article.find(class_="card-text").text.strip()
link = base_url + article.a['href']
#img_url = base_url + article.image_src['src']
crypto_dict = {}
crypto_dict['Headline']= headline
crypto_dict['Text'] = text
crypto_dict['Link']= link
crypto_headlines.append(crypto_dict)
except AttributeError as ex:
print('Error:', ex)
print(crypto_headlines)
crypto_news()
你是从错误的 <a>
抓取的,你是从第一个 <a>
抓取的,但需要 link 在第二个 <a>
.
这是代码
link = base_url + article.find_all("a")[1]["href"]
只需更改此行即可解决您的问题!