如何使用 selenium 和 Python> 抓取嵌套数据
How do I scrape nested data using selenium and Python>
我基本上想在 <span class="visually-hidden">
下抓取 Feb 2016 - Present,但我看不到它。这是 HTML 代码:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
这是我目前在我的代码中使用 selenium 所做的事情:
date= browser.find_element_by_xpath('.//div[@class = "pv-entity__duration de Sans-15px-black-55% ml0"]').text
print date
但这没有结果。我将如何去取消日期?
您可以像这样重写您的 xpath 代码:
# -*- coding: utf-8 -*-
from lxml import html
import unicodedata
html_str = """
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
"""
root = html.fromstring(html_str)
# For fetching Feb 2016 â Present :
txt = root.xpath('//h4[@class="pv-entity__date-range Sans-15px-black-55%"]/span/text()')[1]
# For fetching 1 yr 2 mos :
txt1 = root.xpath('//h4[@class="pv-entity__duration de Sans-15px-black-55% ml0"]/span/text()')[1]
print txt
print txt1
这将导致:
Feb 2016 â Present
1 yr 2 mos
没有 div
和 class="pv-entity__duration de Sans-15px-black-55% ml0"
,而是 h4
。如果你想获取 div
的文本,请尝试:
date= browser.find_element_by_xpath('.//div[@class = "pv-entity__position-info detail-facet m0"]').text
print date
如果你想得到"Feb 2016 - Present"
,那就试试
date= browser.find_element_by_xpath('//h4[@class="pv-entity__date-range Sans-15px-black-55%"]/span[2]').text
print date
我基本上想在 <span class="visually-hidden">
下抓取 Feb 2016 - Present,但我看不到它。这是 HTML 代码:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
这是我目前在我的代码中使用 selenium 所做的事情:
date= browser.find_element_by_xpath('.//div[@class = "pv-entity__duration de Sans-15px-black-55% ml0"]').text
print date
但这没有结果。我将如何去取消日期?
您可以像这样重写您的 xpath 代码:
# -*- coding: utf-8 -*-
from lxml import html
import unicodedata
html_str = """
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
"""
root = html.fromstring(html_str)
# For fetching Feb 2016 â Present :
txt = root.xpath('//h4[@class="pv-entity__date-range Sans-15px-black-55%"]/span/text()')[1]
# For fetching 1 yr 2 mos :
txt1 = root.xpath('//h4[@class="pv-entity__duration de Sans-15px-black-55% ml0"]/span/text()')[1]
print txt
print txt1
这将导致:
Feb 2016 â Present
1 yr 2 mos
没有 div
和 class="pv-entity__duration de Sans-15px-black-55% ml0"
,而是 h4
。如果你想获取 div
的文本,请尝试:
date= browser.find_element_by_xpath('.//div[@class = "pv-entity__position-info detail-facet m0"]').text
print date
如果你想得到"Feb 2016 - Present"
,那就试试
date= browser.find_element_by_xpath('//h4[@class="pv-entity__date-range Sans-15px-black-55%"]/span[2]').text
print date