requests-html: 按内容查找td
requests-html: find td by content
所以我正在尝试使用 requests-html:
来抓取这个 table
<table class="pet-listing__list rescue-details">
<tbody>
<tr>
<td>Rescue group:</td>
<td><a href="/groups/10282/Dog-Rescue-Newcastle">Dog Rescue Newcastle</a></td>
</tr>
<tr>
<td>PetRescue ID:</td>
<td>802283</td>
</tr>
<tr>
<td>Location:</td>
<td>Toronto, NSW</td>
</tr>
<tr>
<td class="first age">Age:</td>
<td class="first age">1 year 2 months</td>
</tr>
<tr>
<td class="adoption_fee">Adoption fee:</td>
<td class="adoption_fee">0.00</td>
</tr>
<tr>
<td class="desexed">Desexed:</td>
<td class="desexed"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="vaccinated">Vaccinated:</td>
<td class="vaccinated"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="wormed">Wormed:</td>
<td class="wormed"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="microchip_number">Microchip number:</td>
<td class="microchip_number">OnFile</td>
</tr>
<tr>
<td class="rehoming_organisation_id">Rehoming organisation:</td>
<td class="rehoming_organisation_id">R251000026</td>
</tr>
</tbody>
</table>
文档似乎没有提到找到下一个 td 的方法,例如如果我想刮狗的救援组或位置。有没有办法只使用请求-html 来抓取 table 的那些单元格,或者这是否需要另外通过例如bs4/lxml/etc。要解析吗?
到目前为止的代码(returns 一个错误,因为 HTMLSession.html.find 没有像 bs4 这样的属性文本):
class PetBarnCrawler(DogCrawler):
"""Looks for dogs on Petbarn"""
def __init__(self, url="https://www.petrescue.com.au/listings/search/dogs"):
super(PetBarnCrawler, self).__init__(url)
def _get_dogs(self, **kwargs):
"""Get listing of all dogs"""
for html in self.current_page.html:
# grab all the dogs on the page
dog_previews = html.find("article.cards-listings-preview")
for preview in dog_previews:
new_session = HTMLSession()
page_link = preview.find("a.cards-listings-preview__content")[0].attrs["href"]
dog_page = new_session.get(page_link)
# populate the dictionary with all the parameters of interest
this_dog = {
"id": os.path.split(urllib.parse.urlparse(dog_page.url).path)[1],
"url": page_link,
"name": dog_page.html.find(".pet-listing__content__name"),
"breed": dog_page.html.find(".pet-listing__content__breed"),
"age": dog_page.html.find("td.age")[1],
"price": dog_page.html.find("td.adoption_fee")[1],
"desexed": dog_page.html.find("td.desexed")[1],
"vaccinated": dog_page.html.find("td.vaccinated")[1],
"wormed": dog_page.html.find("td.wormed")[1],
"feature": dog_page.html.find(".pet-listing__content__feature"),
"rescue_group": dog_page.html.find("td", text="Rescue group:").find_next("td"),
"rehoming_organisation_id": dog_page.html.find("td.rehoming_organisation_id")[1],
"location": dog_page.html.find("td", text="Location:").find_next("td"),
"description": dog_page.html.find(".personality"),
"medical_notes": dog_page.html.find("."),
"adoption_process": dog_page.html.find(".adoption_process"),
}
self.dogs.append(this_dog)
new_session.close()
类似这样的事情应该可以解决您的问题。
tr = table.findAll(['tr'])[3]
[3]指定位置。
##更新时间:09/25
进一步查看网站并查看标签后,您要查找的位置详细信息存储在以下标签中。 'cards-listings-preview__content__section__location'
这段代码允许我从网站上抓取位置详细信息。
location = soup.find_all('strong', attrs={'class':'cards-listings-preview__content__section__location'})
事实证明我没有足够仔细地阅读文档。
使用requests-html中的xpath查询功能应该就足够了,不需要使用bs4或lxml之类的库来遍历文档树:
{
...
"location": dog_page.html.xpath("//tr[td='Location:']/td[2]")[0].text,
...
}
比照。这个 post: XPath:: Get following Sibling
所以我正在尝试使用 requests-html:
来抓取这个 table<table class="pet-listing__list rescue-details">
<tbody>
<tr>
<td>Rescue group:</td>
<td><a href="/groups/10282/Dog-Rescue-Newcastle">Dog Rescue Newcastle</a></td>
</tr>
<tr>
<td>PetRescue ID:</td>
<td>802283</td>
</tr>
<tr>
<td>Location:</td>
<td>Toronto, NSW</td>
</tr>
<tr>
<td class="first age">Age:</td>
<td class="first age">1 year 2 months</td>
</tr>
<tr>
<td class="adoption_fee">Adoption fee:</td>
<td class="adoption_fee">0.00</td>
</tr>
<tr>
<td class="desexed">Desexed:</td>
<td class="desexed"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="vaccinated">Vaccinated:</td>
<td class="vaccinated"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="wormed">Wormed:</td>
<td class="wormed"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="microchip_number">Microchip number:</td>
<td class="microchip_number">OnFile</td>
</tr>
<tr>
<td class="rehoming_organisation_id">Rehoming organisation:</td>
<td class="rehoming_organisation_id">R251000026</td>
</tr>
</tbody>
</table>
文档似乎没有提到找到下一个 td 的方法,例如如果我想刮狗的救援组或位置。有没有办法只使用请求-html 来抓取 table 的那些单元格,或者这是否需要另外通过例如bs4/lxml/etc。要解析吗?
到目前为止的代码(returns 一个错误,因为 HTMLSession.html.find 没有像 bs4 这样的属性文本):
class PetBarnCrawler(DogCrawler):
"""Looks for dogs on Petbarn"""
def __init__(self, url="https://www.petrescue.com.au/listings/search/dogs"):
super(PetBarnCrawler, self).__init__(url)
def _get_dogs(self, **kwargs):
"""Get listing of all dogs"""
for html in self.current_page.html:
# grab all the dogs on the page
dog_previews = html.find("article.cards-listings-preview")
for preview in dog_previews:
new_session = HTMLSession()
page_link = preview.find("a.cards-listings-preview__content")[0].attrs["href"]
dog_page = new_session.get(page_link)
# populate the dictionary with all the parameters of interest
this_dog = {
"id": os.path.split(urllib.parse.urlparse(dog_page.url).path)[1],
"url": page_link,
"name": dog_page.html.find(".pet-listing__content__name"),
"breed": dog_page.html.find(".pet-listing__content__breed"),
"age": dog_page.html.find("td.age")[1],
"price": dog_page.html.find("td.adoption_fee")[1],
"desexed": dog_page.html.find("td.desexed")[1],
"vaccinated": dog_page.html.find("td.vaccinated")[1],
"wormed": dog_page.html.find("td.wormed")[1],
"feature": dog_page.html.find(".pet-listing__content__feature"),
"rescue_group": dog_page.html.find("td", text="Rescue group:").find_next("td"),
"rehoming_organisation_id": dog_page.html.find("td.rehoming_organisation_id")[1],
"location": dog_page.html.find("td", text="Location:").find_next("td"),
"description": dog_page.html.find(".personality"),
"medical_notes": dog_page.html.find("."),
"adoption_process": dog_page.html.find(".adoption_process"),
}
self.dogs.append(this_dog)
new_session.close()
类似这样的事情应该可以解决您的问题。
tr = table.findAll(['tr'])[3]
[3]指定位置。
##更新时间:09/25
进一步查看网站并查看标签后,您要查找的位置详细信息存储在以下标签中。 'cards-listings-preview__content__section__location'
这段代码允许我从网站上抓取位置详细信息。
location = soup.find_all('strong', attrs={'class':'cards-listings-preview__content__section__location'})
事实证明我没有足够仔细地阅读文档。
使用requests-html中的xpath查询功能应该就足够了,不需要使用bs4或lxml之类的库来遍历文档树:
{
...
"location": dog_page.html.xpath("//tr[td='Location:']/td[2]")[0].text,
...
}
比照。这个 post: XPath:: Get following Sibling