如何在 Python 中使用 lxml 查找嵌套的 html 元素?
How to find nested html elements using lxml in Python?
我正在尝试抓取以下 html。
有多个 div,其中 class="review-card"。
每个 divs 始终包含一个脚本元素,其中 data-initial-state="data-always-exist" 有时包含一个脚本元素,其中 data-initial-state="data-may -不存在"。
我想从这两个脚本元素中检索数据。当第二个不存在时,我想 return 一个特定的值,例如0.
如您在下面的代码中所见,我已设法找到“检索卡”div 元素。但是,我无法检索每个 div 元素内的脚本元素。我的代码总是 return 列表与单个元素。我做错了什么?
<html>
<body>
<main>
<div class="review-list">
<div class="review-card">
<article class="review">
<script type="application.json" data-initial-state="data-always-exist">
{"reviewBody":"Brilliant value","stars":5}
</script>
<section class="review__content">
<div class="content">
<script type="application.json" data-initial-state="data-may-not-exist">
{"isVerified":true,"verificationSource":"invitation"}
</script>
</div>
</section>
</article>
</div>
<div class="review-card">
<article class="review">
<script type="application.json data-initial-state="data-always-exist">
{"reviewBody":"Brilliant value","stars":5}
</script>
</article>
</div>
<div class="review-card">
<article class="review">
<script type="application.json" data-initial-state="data-always-exist">
{"reviewBody":"Great","stars":4}
</script>
<section class="review__content">
<div class="content">
<script type="application.json" data-initial-state="data-may-not-exist">
{"isVerified":false,"verificationSource":"invitation"}
</script>
</div>
</section>
</article>
</div>
</div>
</main>
</body>
</html>
我试过以下方法:
from lxml import html
import requests
page = requests.get('http://somewebsite.com')
tree = html.fromstring(page.content)
#finds the review list
review_list = tree.xpath('//div[@class="review-list"]')
#finds all the review cards
review_cards = review_list[0].xpath('//div[contains(@class,"review-card")]')
for card in review_cards:
#this part of the code does not work as intended -returns a list vs a single items.
data_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-always-exist')]")
data_not_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-may-not-exist')]")
使用beautifulsoup
的解决方案:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://somewebsite.com").content, "lxml")
for card in soup.select(".review-card"):
print("data-always-exist:")
d = card.select_one('[data-initial-state="data-always-exist"]')
if d:
print(d.contents[0].strip())
print("data-may-not-exist:")
d = card.select_one('[data-initial-state="data-may-not-exist"]')
if d:
print(d.contents[0].strip())
print("-" * 80)
打印:
data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
{"isVerified":true,"verificationSource":"invitation"}
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Great","stars":4}
data-may-not-exist:
{"isVerified":false,"verificationSource":"invitation"}
--------------------------------------------------------------------------------
带有 lxml
的版本(在您的 XPath 中使用点 (.
)):
# ...
tree = html.fromstring(page.content)
cards = tree.xpath('//div[contains(@class,"review-card")]')
for card in cards:
# this part of the code does not work as intended -returns a list vs a single items.
data_always_exist = card.xpath(
".//script[starts-with(@data-initial-state, 'data-always-exist')]"
)
data_not_always_exist = card.xpath(
".//script[starts-with(@data-initial-state, 'data-may-not-exist')]"
)
print(data_always_exist)
print(data_not_always_exist)
print("-" * 80)
打印:
[<Element script at 0x7fc202aadd10>]
[<Element script at 0x7fc202aade50>]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aadea0>]
[]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aade50>]
[<Element script at 0x7fc202aadea0>]
--------------------------------------------------------------------------------
我正在尝试抓取以下 html。
有多个 div,其中 class="review-card"。
每个 divs 始终包含一个脚本元素,其中 data-initial-state="data-always-exist" 有时包含一个脚本元素,其中 data-initial-state="data-may -不存在"。
我想从这两个脚本元素中检索数据。当第二个不存在时,我想 return 一个特定的值,例如0.
如您在下面的代码中所见,我已设法找到“检索卡”div 元素。但是,我无法检索每个 div 元素内的脚本元素。我的代码总是 return 列表与单个元素。我做错了什么?
<html>
<body>
<main>
<div class="review-list">
<div class="review-card">
<article class="review">
<script type="application.json" data-initial-state="data-always-exist">
{"reviewBody":"Brilliant value","stars":5}
</script>
<section class="review__content">
<div class="content">
<script type="application.json" data-initial-state="data-may-not-exist">
{"isVerified":true,"verificationSource":"invitation"}
</script>
</div>
</section>
</article>
</div>
<div class="review-card">
<article class="review">
<script type="application.json data-initial-state="data-always-exist">
{"reviewBody":"Brilliant value","stars":5}
</script>
</article>
</div>
<div class="review-card">
<article class="review">
<script type="application.json" data-initial-state="data-always-exist">
{"reviewBody":"Great","stars":4}
</script>
<section class="review__content">
<div class="content">
<script type="application.json" data-initial-state="data-may-not-exist">
{"isVerified":false,"verificationSource":"invitation"}
</script>
</div>
</section>
</article>
</div>
</div>
</main>
</body>
</html>
我试过以下方法:
from lxml import html
import requests
page = requests.get('http://somewebsite.com')
tree = html.fromstring(page.content)
#finds the review list
review_list = tree.xpath('//div[@class="review-list"]')
#finds all the review cards
review_cards = review_list[0].xpath('//div[contains(@class,"review-card")]')
for card in review_cards:
#this part of the code does not work as intended -returns a list vs a single items.
data_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-always-exist')]")
data_not_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-may-not-exist')]")
使用beautifulsoup
的解决方案:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://somewebsite.com").content, "lxml")
for card in soup.select(".review-card"):
print("data-always-exist:")
d = card.select_one('[data-initial-state="data-always-exist"]')
if d:
print(d.contents[0].strip())
print("data-may-not-exist:")
d = card.select_one('[data-initial-state="data-may-not-exist"]')
if d:
print(d.contents[0].strip())
print("-" * 80)
打印:
data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
{"isVerified":true,"verificationSource":"invitation"}
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Great","stars":4}
data-may-not-exist:
{"isVerified":false,"verificationSource":"invitation"}
--------------------------------------------------------------------------------
带有 lxml
的版本(在您的 XPath 中使用点 (.
)):
# ...
tree = html.fromstring(page.content)
cards = tree.xpath('//div[contains(@class,"review-card")]')
for card in cards:
# this part of the code does not work as intended -returns a list vs a single items.
data_always_exist = card.xpath(
".//script[starts-with(@data-initial-state, 'data-always-exist')]"
)
data_not_always_exist = card.xpath(
".//script[starts-with(@data-initial-state, 'data-may-not-exist')]"
)
print(data_always_exist)
print(data_not_always_exist)
print("-" * 80)
打印:
[<Element script at 0x7fc202aadd10>]
[<Element script at 0x7fc202aade50>]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aadea0>]
[]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aade50>]
[<Element script at 0x7fc202aadea0>]
--------------------------------------------------------------------------------