如何打开 url 并使用网络爬虫获取其内容
How to get open url and get it's content using web crawler
我正在尝试使用网络爬虫从体育、主页、世界、商业和技术中获取新闻内容,
我有这段代码,它在其中获取页面的 header 和 url,我怎样才能获取页面的 url 并打开它并在 body 中获取它的内容
#python code
import requests
from bs4 import BeautifulSoup
url = "https://www.aaa.com"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')
for title in soup.findAll('a', href=True): #give me type
if re.search(r"\d+$", title['href']):
print(title['href'])
您必须将基础 url 加入您提取的 href
,然后简单地重新开始请求。
for title in soup.find_all('a', href=True):
if re.search(r"\d+$", title['href']):
page = requests.get('https://www.bbc.com'+title['href'])
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.h1.text)
备注
您的 regex
工作不正常,所以要小心
尝试刮擦并使用 time
模块例如添加一些延迟
有些url是重复的
示例(经过一些调整)
将打印文章的前 150 个字符:
import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'
def get_soup(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
return soup
def get_urls(url):
urls = []
for link in get_soup(url).select('a:has(h3)'):
if url.split('/')[-1] in link['href']:
urls.append(baseurl+link['href'])
urls = list(set(urls))
return urls
def get_news(url):
for url in get_urls(url):
item = get_soup(url)
print(item.article.text[:150]+'...')
time.sleep(2)
get_news('https://www.bbc.com/news')
输出
New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...
我正在尝试使用网络爬虫从体育、主页、世界、商业和技术中获取新闻内容, 我有这段代码,它在其中获取页面的 header 和 url,我怎样才能获取页面的 url 并打开它并在 body 中获取它的内容
#python code
import requests
from bs4 import BeautifulSoup
url = "https://www.aaa.com"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')
for title in soup.findAll('a', href=True): #give me type
if re.search(r"\d+$", title['href']):
print(title['href'])
您必须将基础 url 加入您提取的 href
,然后简单地重新开始请求。
for title in soup.find_all('a', href=True):
if re.search(r"\d+$", title['href']):
page = requests.get('https://www.bbc.com'+title['href'])
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.h1.text)
备注
您的
regex
工作不正常,所以要小心尝试刮擦并使用
time
模块例如添加一些延迟有些url是重复的
示例(经过一些调整)
将打印文章的前 150 个字符:
import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'
def get_soup(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
return soup
def get_urls(url):
urls = []
for link in get_soup(url).select('a:has(h3)'):
if url.split('/')[-1] in link['href']:
urls.append(baseurl+link['href'])
urls = list(set(urls))
return urls
def get_news(url):
for url in get_urls(url):
item = get_soup(url)
print(item.article.text[:150]+'...')
time.sleep(2)
get_news('https://www.bbc.com/news')
输出
New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...