使用 BeautifulSoup 获取长网址时出现问题
Problem in fetching long URLs using BeautifulSoup
我正在尝试从网页中获取 URL,下面是 URL 在“检查”部分中的样子:
这是 URL 在我的 python 代码中的样子:
如何使用 BeautifulSoup 获得没有 ../../ 部分的实际 URL?
这是我的代码以备不时之需:
import re
import requests
from bs4 import BeautifulSoup
source = requests.get('https://books.toscrape.com/catalogue/category/books_1/index.html').text
soup = BeautifulSoup(source, 'lxml')
# article = soup.find('article')
# title = article.div.a.img['alt']
# print(title['alt'])
titles, topics,urls,sources = [], [], [],[]
article_productPod = soup.findAll('article', {"class":"product_pod"})
for i in article_productPod:
titles.append(i.div.a.img['alt'])
# print(titles)
for q in article_productPod:
urls.append(q.h3.a['href'])
print(urls[0])
# for z in range(len(urls)):
# source2 = requests.get("https://" + urls[z])
使用 urllib:
import urllib
将目标 URL 存储在单独的变量中:
src_url = r'https://books.toscrape.com/catalogue/category/books_1/index.html'
source = requests.get(src_url).text
加入网站的URL和相对URL:
for q in article_productPod:
urls.append(urllib.parse.urljoin(src_url, q.h3.a['href']))
我正在尝试从网页中获取 URL,下面是 URL 在“检查”部分中的样子:
这是 URL 在我的 python 代码中的样子:
如何使用 BeautifulSoup 获得没有 ../../ 部分的实际 URL? 这是我的代码以备不时之需:
import re
import requests
from bs4 import BeautifulSoup
source = requests.get('https://books.toscrape.com/catalogue/category/books_1/index.html').text
soup = BeautifulSoup(source, 'lxml')
# article = soup.find('article')
# title = article.div.a.img['alt']
# print(title['alt'])
titles, topics,urls,sources = [], [], [],[]
article_productPod = soup.findAll('article', {"class":"product_pod"})
for i in article_productPod:
titles.append(i.div.a.img['alt'])
# print(titles)
for q in article_productPod:
urls.append(q.h3.a['href'])
print(urls[0])
# for z in range(len(urls)):
# source2 = requests.get("https://" + urls[z])
使用 urllib:
import urllib
将目标 URL 存储在单独的变量中:
src_url = r'https://books.toscrape.com/catalogue/category/books_1/index.html'
source = requests.get(src_url).text
加入网站的URL和相对URL:
for q in article_productPod:
urls.append(urllib.parse.urljoin(src_url, q.h3.a['href']))