只有来自第一个 Beautiful Soup 对象的项目被添加到我的列表中
Only items from first Beautiful Soup object are being added to my lists
我怀疑这不是很复杂,但我看不出来。我正在使用 Selenium 和 Beautiful Soup 来解析 Petango.com。数据将用于帮助当地收容所了解他们在不同指标上与其他地区收容所的比较情况。所以接下来将采用这些数据帧并进行一些分析。
我从不同的模块获取详细 url 并在此处导入列表。
我的问题是,我的列表只显示来自第一只狗的 HTML 的值。我逐步完成并注意到我的 len 对于汤迭代是不同的,所以我意识到我的错误是在那之后的某个地方但我无法弄清楚如何解决。
到目前为止,这是我的代码(运行 整个过程与使用缓存页面的对比)
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
from Petango import pet_links
headings = []
values = []
ShelterInfo = []
ShelterInfoWebsite = []
ShelterInfoEmail = []
ShelterInfoPhone = []
ShelterInfoAddress = []
Breed = []
Age = []
Color = []
SpayedNeutered = []
Size = []
Declawed = []
AdoptionDate = []
# to access sites, change url list to pet_links (break out as needed) and change if false to true. false looks to the html file
url_list = (pet_links[4], pet_links[6], pet_links[8])
#url_list = ("Petango.html", "Petango.html", "Petango.html")
for link in url_list:
page_source = None
if True:
#pet page = link should populate links from above, hard code link was for 1 detail page, =to hemtl was for cached site
PetPage = link
#PetPage = 'https://www.petango.com/Adopt/Dog-Terrier-American-Pit-Bull-45569732'
#PetPage = Petango.html
PetDriver = webdriver.Chrome(executable_path='/Users/paulcarson/Downloads/chromedriver')
PetDriver.implicitly_wait(30)
PetDriver.get(link)
page_source = PetDriver.page_source
PetDriver.close()
else:
with open("Petango.html",'r') as f:
page_source = f.read()
PetSoup = BeautifulSoup(page_source, 'html.parser')
print(len(PetSoup.text))
#get the details about the shelter and add to lists
ShelterInfo.append(PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find('h4').text)
ShelterInfoParagraphs = PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find_all('p')
First_Paragraph = ShelterInfoParagraphs[0]
if "Website" not in First_Paragraph.text:
raise AssertionError("first paragraph is not about site")
ShelterInfoWebsite.append(First_Paragraph.find('a').text)
Second_Paragraph = ShelterInfoParagraphs[1]
ShelterInfoEmail.append(Second_Paragraph.find('a')['href'])
Third_Paragraph = ShelterInfoParagraphs[2]
ShelterInfoPhone.append(Third_Paragraph.find('span').text)
Fourth_Paragraph = ShelterInfoParagraphs[3]
ShelterInfoAddress.append(Fourth_Paragraph.find('span').text)
#get the details about the pet
ul = PetSoup.find('div', class_='group details-list').ul # Gets the ul tag
li_items = ul.find_all('li') # Finds all the li tags within the ul tag
for li in li_items:
heading = li.strong.text
headings.append(heading)
value = li.span.text
if value:
values.append(value)
else:
values.append(None)
Breed.append(values[0])
Age.append(values[1])
print(Age)
Color.append(values[2])
SpayedNeutered.append(values[3])
Size.append(values[4])
Declawed.append(values[5])
AdoptionDate.append(values[6])
ShelterDF = pd.DataFrame(
{
'Shelter': ShelterInfo,
'Shelter Website': ShelterInfoWebsite,
'Shelter Email': ShelterInfoEmail,
'Shelter Phone Number': ShelterInfoPhone,
'Shelter Address': ShelterInfoAddress
})
PetDF = pd.DataFrame(
{'Breed': Breed,
'Age': Age,
'Color': Color,
'Spayed/Neutered': SpayedNeutered,
'Size': Size,
'Declawed': Declawed,
'Adoption Date': AdoptionDate
})
print(PetDF)
print(ShelterDF)
随着循环的进行打印 len 和 age 值的输出
12783
['6y 7m']
10687
['6y 7m', '6y 7m']
10705
['6y 7m', '6y 7m', '6y 7m']
有人能给我指出正确的方向吗?
感谢您的帮助!
保罗
您需要将 find
方法更改为 BeautifulSoup 中的 find_all()
以便它定位所有元素。
Values 是全局的,您只能将此列表中的第一个值附加到 Age
Age.append(values[1])
您的其他全局列表存在同样的问题(静态索引是 1 还是 2 等...)。
您需要一种方法来跟踪可能通过计数器使用的适当索引,或者确定其他逻辑以确保添加当前值,例如以当前的年龄,它是循环中的第二个 li 吗?或者只是附加 PetSoup.select_one("[data-bind='text: age']").text
看起来每个项目都感兴趣,例如colour, spayed 包含 data-bind
属性,因此您可以将具有适当属性值的属性用于 select 每个值,并避免在 li
元素上循环。
例如current_colour = PetSoup.select_one("[data-bind='text: color']").text
最好在使用 .text
访问之前设置变量并测试 is not None
我怀疑这不是很复杂,但我看不出来。我正在使用 Selenium 和 Beautiful Soup 来解析 Petango.com。数据将用于帮助当地收容所了解他们在不同指标上与其他地区收容所的比较情况。所以接下来将采用这些数据帧并进行一些分析。 我从不同的模块获取详细 url 并在此处导入列表。 我的问题是,我的列表只显示来自第一只狗的 HTML 的值。我逐步完成并注意到我的 len 对于汤迭代是不同的,所以我意识到我的错误是在那之后的某个地方但我无法弄清楚如何解决。 到目前为止,这是我的代码(运行 整个过程与使用缓存页面的对比)
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
from Petango import pet_links
headings = []
values = []
ShelterInfo = []
ShelterInfoWebsite = []
ShelterInfoEmail = []
ShelterInfoPhone = []
ShelterInfoAddress = []
Breed = []
Age = []
Color = []
SpayedNeutered = []
Size = []
Declawed = []
AdoptionDate = []
# to access sites, change url list to pet_links (break out as needed) and change if false to true. false looks to the html file
url_list = (pet_links[4], pet_links[6], pet_links[8])
#url_list = ("Petango.html", "Petango.html", "Petango.html")
for link in url_list:
page_source = None
if True:
#pet page = link should populate links from above, hard code link was for 1 detail page, =to hemtl was for cached site
PetPage = link
#PetPage = 'https://www.petango.com/Adopt/Dog-Terrier-American-Pit-Bull-45569732'
#PetPage = Petango.html
PetDriver = webdriver.Chrome(executable_path='/Users/paulcarson/Downloads/chromedriver')
PetDriver.implicitly_wait(30)
PetDriver.get(link)
page_source = PetDriver.page_source
PetDriver.close()
else:
with open("Petango.html",'r') as f:
page_source = f.read()
PetSoup = BeautifulSoup(page_source, 'html.parser')
print(len(PetSoup.text))
#get the details about the shelter and add to lists
ShelterInfo.append(PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find('h4').text)
ShelterInfoParagraphs = PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find_all('p')
First_Paragraph = ShelterInfoParagraphs[0]
if "Website" not in First_Paragraph.text:
raise AssertionError("first paragraph is not about site")
ShelterInfoWebsite.append(First_Paragraph.find('a').text)
Second_Paragraph = ShelterInfoParagraphs[1]
ShelterInfoEmail.append(Second_Paragraph.find('a')['href'])
Third_Paragraph = ShelterInfoParagraphs[2]
ShelterInfoPhone.append(Third_Paragraph.find('span').text)
Fourth_Paragraph = ShelterInfoParagraphs[3]
ShelterInfoAddress.append(Fourth_Paragraph.find('span').text)
#get the details about the pet
ul = PetSoup.find('div', class_='group details-list').ul # Gets the ul tag
li_items = ul.find_all('li') # Finds all the li tags within the ul tag
for li in li_items:
heading = li.strong.text
headings.append(heading)
value = li.span.text
if value:
values.append(value)
else:
values.append(None)
Breed.append(values[0])
Age.append(values[1])
print(Age)
Color.append(values[2])
SpayedNeutered.append(values[3])
Size.append(values[4])
Declawed.append(values[5])
AdoptionDate.append(values[6])
ShelterDF = pd.DataFrame(
{
'Shelter': ShelterInfo,
'Shelter Website': ShelterInfoWebsite,
'Shelter Email': ShelterInfoEmail,
'Shelter Phone Number': ShelterInfoPhone,
'Shelter Address': ShelterInfoAddress
})
PetDF = pd.DataFrame(
{'Breed': Breed,
'Age': Age,
'Color': Color,
'Spayed/Neutered': SpayedNeutered,
'Size': Size,
'Declawed': Declawed,
'Adoption Date': AdoptionDate
})
print(PetDF)
print(ShelterDF)
随着循环的进行打印 len 和 age 值的输出
12783
['6y 7m']
10687
['6y 7m', '6y 7m']
10705
['6y 7m', '6y 7m', '6y 7m']
有人能给我指出正确的方向吗?
感谢您的帮助!
保罗
您需要将 find
方法更改为 BeautifulSoup 中的 find_all()
以便它定位所有元素。
Values 是全局的,您只能将此列表中的第一个值附加到 Age
Age.append(values[1])
您的其他全局列表存在同样的问题(静态索引是 1 还是 2 等...)。
您需要一种方法来跟踪可能通过计数器使用的适当索引,或者确定其他逻辑以确保添加当前值,例如以当前的年龄,它是循环中的第二个 li 吗?或者只是附加 PetSoup.select_one("[data-bind='text: age']").text
看起来每个项目都感兴趣,例如colour, spayed 包含 data-bind
属性,因此您可以将具有适当属性值的属性用于 select 每个值,并避免在 li
元素上循环。
例如current_colour = PetSoup.select_one("[data-bind='text: color']").text
最好在使用 .text
is not None