我需要帮助找到正确的 html 标题链接标签 url 我的网络抓取工具。我的抓取工具的目的是抓取标题、故事、链接

Question

当我运行抓取程序时，我的 Django 主页上的输出没问题，但是 url 显示错误消息 404 和其他文章显示我使用了错误的标签 https://www.coindesk.com/news/tag/crypto-lending the correct link url is https://www.coindesk.com/news/tag/crypto-lending。带有 link 的正确标签是

from bs4 import BeautifulSoup
import requests

crypto_headlines = []


def crypto_news():
    """ user agent to facilitates end-user interaction with web content"""

    headers = {
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
    }

    base_url ='https://www.coindesk.com/news'

    source = requests.get(base_url).text

    soup = BeautifulSoup(source, "html.parser")       
    
    
    articles = soup.find_all(class_ = 'text-content')
    
    #print(len(articles))
    #print(articles) 

    
    for article in articles:
        
        try:
    
            headline = article.h4.text.strip()
            text = article.find(class_="card-text").text.strip()
            link = base_url + article.a['href']
            #img_url = base_url + article.image_src['src']

            crypto_dict = {}

            crypto_dict['Headline']= headline
            crypto_dict['Text'] = text
            crypto_dict['Link']= link


            crypto_headlines.append(crypto_dict)
        except AttributeError as ex:
            print('Error:', ex)

    print(crypto_headlines)

crypto_news()

Answer 1

你是从错误的 <a> 抓取的，你是从第一个 <a> 抓取的，但需要 link 在第二个 <a>.

这是代码

link = base_url + article.find_all("a")[1]["href"]

只需更改此行即可解决您的问题！

我需要帮助找到正确的 html 标题链接标签 url 我的网络抓取工具。我的抓取工具的目的是抓取标题、故事、链接

I need help finding the correct html tag for headline links url my web scraper. The purpose of of my scraper is to scrape headlines, stories ,links

html

python

beautifulsoup

hyperlink

web-scraping