如何使用Python和lxml进入xpath中的特定节点

Question

想像以下HTML

<div class="group>
    <ul class="smallList">
        <li><strong>Date</strong>
            some Date
       </li>
       <li>
            <strong>Author</strong>
            some Name
       </li>
       <li>
            <strong>Keywords</strong>
            <a href="linka"
            rel="nofollow">keyworda</a>,
            <a href="linkb"
             rel="nofollow">Keywordb</a>,
       </li>
       <li>
            <strong>Print</strong>
            <a class="icon print" rel="nofollow" href="javascript:window.print()">print page</a>
       </li>
    </ul>
</div>
<div class="group>
    <ul class="smallList">
        <li><a href="linkc">Linktext</a></li>
    </ul>
<div>

我正在寻找 keyworda 和 keywordb。因此只有包含关键字

的 lsistelement 中的单词

我可以使用

获取所有节点

.//div[@class='group']/ul[@class='smallList']/li/a/node()

但是我如何只输入特定的那个？

Answer 1

我假设您想使用 Xpath 获取关键字条目。 contains function can help here. I'll use the parsel 库，仅仅是因为它易于使用 IMO。这也可以使用 lxml 或 Python 中的其他库进行复制。

data = "[your html above here]"
from parsel import Selector
sel = Selector(data)

#the path looks for the hyperlink and checks for two conditions:
#1. href contains link AND
#2. rel contains nofollow. 
#after that access the text for this path
path = ".//a[contains(@href,'link') and contains(@rel,'nofollow')]/text()"

#extract text using getall() :
print(sel.xpath(path).getall())

['keyworda', 'Keywordb']

如何使用Python和lxml进入xpath中的特定节点

How to enter a specific node in xpath with Python and lxml

python

xpath

lxml