Lxml 获取所有项目但也测试下一个 - Python

Question

我在尝试解析此 lxml 时遇到了麻烦。我正在使用 python 语言 3.6.9。

是这样的。

<download date="22/05/2020 08:34">
    <link url="http://xpto" document="y"/>
    <link url="http://xpto" document="y"/>
    <subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>
    <link url="http://xpto" document="z"/>
    <subjects number="1"><subject>Text explaining the previous link</subject></subjects>
    <link url="http://xpto" document="y"/>
    <link url="http://xpto" document="z"/>
</download>

目前，我可以使用此函数获得所有 links（这很容易完成）：

import requests
from lxml import html 
response = html.fromstring(requests.post(url_post, data=data).content)
links = response.xpath('//link')

正如我在 lxml 中指出的那样，主题（如果存在）旨在解释前面的 link。有时，它可以有多个主题（如上例，其中一个主题的编号为 2，这意味着它里面有两个 'subject' 项目，但另一个 'subjects' 只有一个主题） .它是一个很大的 lxml 文件，所以这种差异（很多 link 直到它有一个 link 之后有一个解释）经常发生。

我如何构建一个查询来获取所有这些 links，并且当它旁边的主题存在时（更准确地说，在 link 之后），将它们放在一起或插入它也进入 link?

我的梦想是这样的：

<link url="http://xpto" document="y" subjects="Text explaining the previous link| Another text explaining the thing"/>

同时包含 link 和主题的列表也会有很大帮助。

[
[<link url="http://xpto" document="y"/>], 
[<link url="http://xpto" document="y"/>, <subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>],
[<link url="http://xpto" document="y"/>], 
]

当然，请随意提出不同的建议。

谢谢大家！

Answer 1

这正是我认为您需要的：

from lxml import html

example = """
<link url="some_url" document="a"/>
<link url="some_url" document="b"/>
<subjects><subject>some text</subject></subjects>
<link url="some_url" document="c"/>
<link url="some_url" document="d"/>
<subjects><subject>some text</subject><subject>some more</subject></subjects>
"""

response = html.fromstring(example)
links = response.xpath('//link')
result = []
for link in links:
    result.append([link])
    next_element = link.getnext()
    if next_element is not None and next_element.tag == 'subjects':
        result[-1].append(next_element)

print(result)

结果：

[[<Element link at 0x1a0891e0d60>], [<Element link at 0x1a0891e0db0>, <Element subjects at 0x1a089096360>], [<Element link at 0x1a0891e0e00>], [<Element link at 0x1a0891e0e50>, <Element subjects at 0x1a0891e0d10>]]

请注意，列表仍然包含 lxml Element 对象，如果需要，您当然可以将它们转换为字符串。

关键的一步是next_element = link.getnext()行。对于 lxml Element，.getnext() 方法 return 是文档中的下一个兄弟。因此，尽管您循环遍历与 .xpath() 匹配的 link 元素，但如果 link.getnext() 是文档中的下一个同级元素，则它仍将为您提供 subjects 元素。如果没有下一个元素（即最后一个 link，如果它后面没有 subjects），.getnext() 将 return None，这就是为什么以下代码行检查 is not None.

Answer 2

这不是最优雅的做事方式，但它完成了工作...

subjects= """
<download date="22/05/2020 08:34">
    <link url="http://xpto" document="y"/>
    <link url="http://xpto" document="y"/>
    <subjects number="2">
      <subject>First Text explaining the previous link</subject>
      <subject>Another text explaining the previous link</subject>
     </subjects>
    <link url="http://xpto2" document="z"/>
    <subjects number="1"><subject>Second Text explaining the previous link</subject></subjects>
    <link url="http://xpto3" document="y"/>
    <link url="http://xpto4" document="z"/>
</download>

"""
#Note that I changed your html a bit to emphasize the differences between nodes

import lxml.html as lh
import elementpath
doc = lh.fromstring(subjects)

elements = elementpath.select(doc, "//link[following-sibling::*[1][name()='subjects']]/concat('<link url=',./@url, ' document=xxx',@document,'xxx subjects=xxx',string-join(./following-sibling::subjects[1]//subject,' | '),'xxx/>')")
# I needed to use the xxx placeholder because I couldn't find a way to escape the double quote marks inside the expression, and this way is simple to implement    

for element in elements:
    print(element.replace('xxx','"'))

输出：

<link url=http://xpto document="y" subjects="First Text explaining the previous link | Another text explaining the previous link"/>
<link url=http://xpto2 document="z" subjects="Second Text explaining the previous link"/>

Answer 3

我想到了这个解决方案。它比@grismar 的建议慢一点，但实现了将 'subjects' 插入 link。另一方面，它让我无需再遍历列表一次来解析“[[link, subjects],]' 元素。

filteredData = response.xpath('//link | //subjects') #get both link and subjects into a list
for i, item in enumerate(filteredData):        
    if item.tag == 'subjects':
        filteredData[i-1].append(item)  
        filteredData.remove(item)

Lxml 获取所有项目但也测试下一个 - Python

Lxml Get all itens but test the next one as well - Python

python

parsing

lxml