使用 BeautifulSoup 抓取时如何处理某些页面中缺失的元素
How to treat missing elements from certain pages when scraping with BeautifulSoup
我需要从一系列产品页面中抓取下面的代码,然后将其拆分以分别显示作者和插图画家。
问题是:
有些页面既有作者 <li>
也有插画家 <li>
,如第 1 页
某些页面只有作者 <li>
,如第 2 页
某些页面既没有作者也没有插图画家,因此根本没有 <ul>
,如第 3 页
了解 <li>
是否适用于插画家的唯一方法是 <li>
是否包含文本“(Illustreeder)”。
当 author 和 illustrator 为空时,如何为其分配默认值?
<ul class="product-brands">
<li class="brand-item">
<a href="https://lapa.co.za/Skrywer/zinelda-mcdonald-illustreerder.html" title="Zinelda McDonald (Illustreerder)">Zinelda McDonald (Illustreerder)</a>
</li>
<li class="brand-item">
<a href="https://lapa.co.za/Skrywer/jose-reinette-palmer.html" title="Jose Palmer & Reinette Lombard">Jose Palmer & Reinette Lombard</a>
</li>
</ul>
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}
# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'
# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'
# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'
# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'
illustrator = '(Illustreerder)'
productlist = []
r = requests.get(page2, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "")
stocks = soup.find('div', class_='stock available')
if stocks is not None:
stock = stocks.text.strip()
if stocks is None:
stock = 'n/a'
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
for litag in ultag.find_all('li'):
author = litag.text.strip() or 'None'
if illustrator not in author:
author = author
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
for litag in ultag.find_all('li'):
author = litag.text.strip()
if illustrator in author:
illustrator = author
bookdata = [isbn, stock, author, illustrator]
print(bookdata)
预期输出:
r = requests.get(page1, headers=headers)
['9781776356515', 'In voorraad', 'Jose Palmer & Reinette Lombard', 'Zinelda McDonald']
预期输出:
r = requests.get(page2, headers=headers)
['9780799383874', 'In voorraad', 'Jaco Jacobs', 'None']
预期输出:
r = requests.get(page3, headers=headers)
['9780799383690', 'In voorraad', 'None', 'None']
你可以这样做。
首先select你需要的<ul>
使用find()
ul = soup.find('ul', class_='product-brands')
现在检查 <ul>
是否存在。如果 True
那么您至少有一位作者或插画家或两者兼而有之。
如果True
,则获取<ul>
元素内<li>
标签的字符串和return列表。您可以使用 .stripped_strings
获取标签内所有字符串的列表。
如果False
只是return None
.
if ul:
return list(ul.stripped_strings)
return None
根据列表中的项目数 returned 我认为很容易弄清楚你在问题中提到的是什么:
The only way to know whether the <li>
is for illustrator, is if the <li>
contains the text "(Illustreerder)".
这是给出作者和插画师列表的代码(如果其中一个存在)否则 None
.
import requests
from bs4 import BeautifulSoup
# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'
# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'
# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'
# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'
def test(url):
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
ul = soup.find('ul', class_='product-brands')
# Setting Default values for author and illustrator
author, illustrator = None, None
# Return a list only if ul is not None
if ul:
details = list(ul.stripped_strings)
# Assigning the names to "author" and "illustrator"
for name in details:
if name.endswith('(Illustreerder)'):
illustrator = name
else:
author = name
return (author, illustrator)
# Iterate over the pages and call the test() function to get author and illustrator names
for page in [page1, page2, page3, page4]:
author, illustrator = test(page)
print(f'Authors: {author}\nIllustrators: {illustrator}\n')
现在,您将 author
和 illustrator
名称分开并存储在每个页面的不同变量中。
Authors: Jose Palmer & Reinette Lombard
Illustrators: Zinelda McDonald (Illustreerder)
Authors: Jaco Jacobs
Illustrators: None
Authors: None
Illustrators: None
Authors: Jan de Wet
Illustrators: None
我需要从一系列产品页面中抓取下面的代码,然后将其拆分以分别显示作者和插图画家。
问题是:
有些页面既有作者 <li>
也有插画家 <li>
,如第 1 页
某些页面只有作者 <li>
,如第 2 页
某些页面既没有作者也没有插图画家,因此根本没有 <ul>
,如第 3 页
了解 <li>
是否适用于插画家的唯一方法是 <li>
是否包含文本“(Illustreeder)”。
当 author 和 illustrator 为空时,如何为其分配默认值?
<ul class="product-brands">
<li class="brand-item">
<a href="https://lapa.co.za/Skrywer/zinelda-mcdonald-illustreerder.html" title="Zinelda McDonald (Illustreerder)">Zinelda McDonald (Illustreerder)</a>
</li>
<li class="brand-item">
<a href="https://lapa.co.za/Skrywer/jose-reinette-palmer.html" title="Jose Palmer & Reinette Lombard">Jose Palmer & Reinette Lombard</a>
</li>
</ul>
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}
# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'
# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'
# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'
# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'
illustrator = '(Illustreerder)'
productlist = []
r = requests.get(page2, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
isbn = soup.find('div', class_='value', itemprop='sku').text.replace(" ", "")
stocks = soup.find('div', class_='stock available')
if stocks is not None:
stock = stocks.text.strip()
if stocks is None:
stock = 'n/a'
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
for litag in ultag.find_all('li'):
author = litag.text.strip() or 'None'
if illustrator not in author:
author = author
for ultag in soup.find_all('ul', {'class': 'product-brands'}):
for litag in ultag.find_all('li'):
author = litag.text.strip()
if illustrator in author:
illustrator = author
bookdata = [isbn, stock, author, illustrator]
print(bookdata)
预期输出:
r = requests.get(page1, headers=headers)
['9781776356515', 'In voorraad', 'Jose Palmer & Reinette Lombard', 'Zinelda McDonald']
预期输出:
r = requests.get(page2, headers=headers)
['9780799383874', 'In voorraad', 'Jaco Jacobs', 'None']
预期输出:
r = requests.get(page3, headers=headers)
['9780799383690', 'In voorraad', 'None', 'None']
你可以这样做。
首先select你需要的
<ul>
使用find()
ul = soup.find('ul', class_='product-brands')
现在检查
<ul>
是否存在。如果True
那么您至少有一位作者或插画家或两者兼而有之。如果
True
,则获取<ul>
元素内<li>
标签的字符串和return列表。您可以使用.stripped_strings
获取标签内所有字符串的列表。如果
False
只是returnNone
.if ul: return list(ul.stripped_strings) return None
根据列表中的项目数 returned 我认为很容易弄清楚你在问题中提到的是什么:
The only way to know whether the
<li>
is for illustrator, is if the<li>
contains the text "(Illustreerder)".
这是给出作者和插画师列表的代码(如果其中一个存在)否则 None
.
import requests
from bs4 import BeautifulSoup
# AUTHOR & ILLUSTRATOR
page1 = 'https://lapa.co.za/kinder-en-tienerboeke/leer-my-lees-vlak-r-grootboek-10-tippie-help-vir-frikkie'
# AUTHOR ONLY
page2 = 'https://lapa.co.za/catalog/product/view/id/1649/s/hoendervleis-grillerige-stories-en-rympies/category/84/'
# NO AUTHOR and NO ILLUSTRATOR
page3 = 'https://lapa.co.za/catalog/product/view/id/1633/s/sanri-steyn-7-vampiere-van-vlermuishoogte/category/84/'
# PAGE WITH NO STOCK
page4 = 'https://lapa.co.za/kinder-en-tienerboeke/my-groot-lofkleuterbybel-2-oudiomusiek'
def test(url):
headers = {
'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
ul = soup.find('ul', class_='product-brands')
# Setting Default values for author and illustrator
author, illustrator = None, None
# Return a list only if ul is not None
if ul:
details = list(ul.stripped_strings)
# Assigning the names to "author" and "illustrator"
for name in details:
if name.endswith('(Illustreerder)'):
illustrator = name
else:
author = name
return (author, illustrator)
# Iterate over the pages and call the test() function to get author and illustrator names
for page in [page1, page2, page3, page4]:
author, illustrator = test(page)
print(f'Authors: {author}\nIllustrators: {illustrator}\n')
现在,您将 author
和 illustrator
名称分开并存储在每个页面的不同变量中。
Authors: Jose Palmer & Reinette Lombard
Illustrators: Zinelda McDonald (Illustreerder)
Authors: Jaco Jacobs
Illustrators: None
Authors: None
Illustrators: None
Authors: Jan de Wet
Illustrators: None