如何打开 url 并使用网络爬虫获取其内容

Question

我正在尝试使用网络爬虫从体育、主页、世界、商业和技术中获取新闻内容，我有这段代码，它在其中获取页面的 header 和 url，我怎样才能获取页面的 url 并打开它并在 body 中获取它的内容

#python code
import requests
from bs4 import BeautifulSoup

url = "https://www.aaa.com"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')

for title in soup.findAll('a', href=True): #give me type
    if re.search(r"\d+$", title['href']):
      print(title['href'])

Answer 1

您必须将基础 url 加入您提取的 href，然后简单地重新开始请求。

for title in soup.find_all('a', href=True): 
    if re.search(r"\d+$", title['href']):
        
        page = requests.get('https://www.bbc.com'+title['href'])
        soup = BeautifulSoup(page.content, 'html.parser')
        print(soup.h1.text)

备注

您的 regex 工作不正常，所以要小心
尝试刮擦并使用 time 模块例如添加一些延迟
有些url是重复的

示例（经过一些调整）

将打印文章的前 150 个字符：

import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'

def get_soup(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

def get_urls(url):
    urls = []
    for link in get_soup(url).select('a:has(h3)'):
        if url.split('/')[-1] in link['href']:
            urls.append(baseurl+link['href'])
    urls = list(set(urls))
    return urls

def get_news(url):
    for url in get_urls(url):
        item = get_soup(url)
        print(item.article.text[:150]+'...')
        time.sleep(2)

get_news('https://www.bbc.com/news')

输出

New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...

如何打开 url 并使用网络爬虫获取其内容

How to get open url and get it's content using web crawler

python

web-crawler

备注

示例（经过一些调整）

输出