Python：使用 lxml xpath 从所有 HTML 个子元素文本中获取文本

Question

我正在使用 python 的 lxml xpath。如果我提供 HTML 标签的完整路径，我就可以提取文本。但是我无法将标签中的所有文本及其子元素提取到列表中。因此，例如给定此 html 我想获取“示例”的所有文本 class:

<div class="example">
    "Some text"
    <div>
        "Some text 2"
        <p>"Some text 3"</p>
        <p>"Some text 4"</p>
        <span>"Some text 5"</span>
    </div>
    <p>"Some text 6"</p> 
</div>

我想得到：

["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]

Answer 1

mzjn-s 回答正确。经过反复试验，我设法让它工作了。这就是最终代码的样子。您需要将 //text() 放在 xpath 的末尾。暂时没有重构，所以肯定会有一些错误和不好的做法，但它是有效的。

    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    page = session.get("The url you are webscraping")
    content = page.content

    htmlsite = urllib.request.urlopen("The url you are webscraping")
    soup = BeautifulSoup(htmlsite, 'lxml')
    htmlsite.close()

    tree = html.fromstring(content)
    scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')

我在keeleyteton.com的团队介绍页面上试过了。它返回了以下正确的列表（尽管需要大量修改！），因为它们位于不同的标签中，有些是子标签。感谢您的帮助！

['\r\n        ', '\r\n        ', 'Nicholas F. Galluccio', '\r\n        ', '\r\n        ', 'Managing Director and Portfolio Manager', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Scott R. Butler', '\r\n        ', '\r\n        ', 'Senior Vice President and Portfolio Manager ', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Thomas E. Browne, Jr., CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian P. Leonard, CFA', '\r\n        ', '\r\n
  ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Robert M. Goldsborough', '\r\n        ', '\r\n        ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian R. Keeley, CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Edward S. Borland', '\r\n        ', '\r\n
  ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Kevin M. Keeley', '\r\n        ', '\r\n        ', 'President', '\r\n
 ', '\r\n        ', '\r\n        ', 'Deanna B. Marotz', '\r\n        ', '\r\n        ', 'Chief Compliance Officer', '\r\n      ']

Python：使用 lxml xpath 从所有 HTML 个子元素文本中获取文本

Python: Get text from all HTML child elements texts with lxml xpath

python

xpath

lxml