如何打开 url 并使用网络爬虫获取其内容

How to get open url and get it's content using web crawler

我正在尝试使用网络爬虫从体育、主页、世界、商业和技术中获取新闻内容, 我有这段代码,它在其中获取页面的 header 和 url,我怎样才能获取页面的 url 并打开它并在 body 中获取它的内容

#python code
import requests
from bs4 import BeautifulSoup

url = "https://www.aaa.com"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')

for title in soup.findAll('a', href=True): #give me type
    if re.search(r"\d+$", title['href']):
      print(title['href'])

您必须将基础 url 加入您提取的 href,然后简单地重新开始请求。

for title in soup.find_all('a', href=True): 
    if re.search(r"\d+$", title['href']):
        
        page = requests.get('https://www.bbc.com'+title['href'])
        soup = BeautifulSoup(page.content, 'html.parser')
        print(soup.h1.text)
备注
  • 您的 regex 工作不正常,所以要小心

  • 尝试刮擦并使用 time 模块例如添加一些延迟

  • 有些url是重复的

示例(经过一些调整)

将打印文章的前 150 个字符:

import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'

def get_soup(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

def get_urls(url):
    urls = []
    for link in get_soup(url).select('a:has(h3)'):
        if url.split('/')[-1] in link['href']:
            urls.append(baseurl+link['href'])
    urls = list(set(urls))
    return urls

def get_news(url):
    for url in get_urls(url):
        item = get_soup(url)
        print(item.article.text[:150]+'...')
        time.sleep(2)

get_news('https://www.bbc.com/news')

输出

New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...