从维基百科文章的孤立部分抓取链接
Scraping the links from an isolated part of a Wikipedia article
所以 I:m 尝试创建一个抓取程序来隔离页面的参考部分,然后从该网页中抓取标题和第一段或类似内容。
目前,我已经做到了,因此它可以隔离参考页面,但我不确定 'entering' 其他链接如何进行。
到目前为止,这是我的代码
def customScrape(e1, master):
session = requests.Session()
# selectWikiPage = input("Please enter the Wikipedia page you wish to scrape from")
selectWikiPage = e1.get()
if "wikipedia" in selectWikiPage: #turn this into a re
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
findReferences = bsObj.find('ol', {'class': 'references'}) # isolate refereces section of page
href = BeautifulSoup(str(findReferences), "html.parser")
links = [a["href"] for a in href.find_all("a", href=True)]
for link in links:
print("Link: " + link)
else:
print("Error: Please enter a valid Wikipedia URL")
在您的 customScrape
函数中,您可以对每个 link:
执行此操作
ref_html = requests.get(link).text
从 link
获取完整文本(您不需要 Session
除非您想在后续请求之间保存 cookie 和其他内容)。
然后,您可以将 ref_html
解析为 find
标题或第一个标题或您喜欢的任何内容。
您的函数可能如下所示:
import requests, requests.exceptions
from bs4 import BeautifulSoup
def custom_scrape(wikipedia_url):
wikipedia_html = requests.get(wikipedia_url).text
refs = BeautifulSoup(wikipedia_html, 'html.parser').find('ol', {
'class': 'references'
})
refs = refs.select('a["class"]')
for ref in refs:
try:
ref_html = requests.get(ref['href']).text
title = heading = BeautifulSoup(ref_html, 'html.parser')
title = title.select('title')
title = title[0].text if title else ''
heading = heading.select('h1')
heading = heading[0].text if heading else ''
except requests.exceptions.RequestException as e:
print(ref['href'], e) # some refs may contain invalid urls
title = heading = ''
yield title.strip(), heading.strip() # strip whitespace
然后您可以查看结果:
for title, heading in custom_scrape('https://en.wikipedia.org/wiki/Stack_Overflow'):
print(title, heading)
所以 I:m 尝试创建一个抓取程序来隔离页面的参考部分,然后从该网页中抓取标题和第一段或类似内容。 目前,我已经做到了,因此它可以隔离参考页面,但我不确定 'entering' 其他链接如何进行。
到目前为止,这是我的代码
def customScrape(e1, master):
session = requests.Session()
# selectWikiPage = input("Please enter the Wikipedia page you wish to scrape from")
selectWikiPage = e1.get()
if "wikipedia" in selectWikiPage: #turn this into a re
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
findReferences = bsObj.find('ol', {'class': 'references'}) # isolate refereces section of page
href = BeautifulSoup(str(findReferences), "html.parser")
links = [a["href"] for a in href.find_all("a", href=True)]
for link in links:
print("Link: " + link)
else:
print("Error: Please enter a valid Wikipedia URL")
在您的 customScrape
函数中,您可以对每个 link:
ref_html = requests.get(link).text
从 link
获取完整文本(您不需要 Session
除非您想在后续请求之间保存 cookie 和其他内容)。
然后,您可以将 ref_html
解析为 find
标题或第一个标题或您喜欢的任何内容。
您的函数可能如下所示:
import requests, requests.exceptions
from bs4 import BeautifulSoup
def custom_scrape(wikipedia_url):
wikipedia_html = requests.get(wikipedia_url).text
refs = BeautifulSoup(wikipedia_html, 'html.parser').find('ol', {
'class': 'references'
})
refs = refs.select('a["class"]')
for ref in refs:
try:
ref_html = requests.get(ref['href']).text
title = heading = BeautifulSoup(ref_html, 'html.parser')
title = title.select('title')
title = title[0].text if title else ''
heading = heading.select('h1')
heading = heading[0].text if heading else ''
except requests.exceptions.RequestException as e:
print(ref['href'], e) # some refs may contain invalid urls
title = heading = ''
yield title.strip(), heading.strip() # strip whitespace
然后您可以查看结果:
for title, heading in custom_scrape('https://en.wikipedia.org/wiki/Stack_Overflow'):
print(title, heading)