如何获取具有相对 XPath 的元素？

Question

我有 xml 文件。在用 lxml 作为 etree 解析它之后，我可以得到它的所有标签如下：

root = tree.getroot()
for e in root.iter():
    print e.tag

输出是这样的：

'{http://www.w3.org/1999/xhtml}html'
'{http://www.w3.org/1999/xhtml}head'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}link'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}script'
'{http://www.w3.org/1999/xhtml}body'
'{http://www.w3.org/1999/xhtml}section'
'{http://www.w3.org/1999/xhtml}h1'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}em'
'{http://www.w3.org/1999/xhtml}section'
'{http://www.w3.org/1999/xhtml}h1'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}a'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}p'

我想使用 python/lxml/bs4. 获取一些具有相对路径的元素例如我想要第二个 section 中的第一个 p 元素和我有以下相对路径：/section[2]/p[1] .

但我什至无法使用以下代码获取所有部分，returns None:

xhtml = {http://www.w3.org/1999/xhtml}
section = xhtml + "section"
root.find(section)

编辑：这是原始文件的一部分：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="grammar/rash.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml" prefix="schema: http://schema.org/ prism: http://prismstandard.org/namespaces/basic/2.0/">
   <head>
      <meta charset="UTF-8"/>
      <meta name="viewport" content="width=device-width, initial-scale=1"/>
      <link rel="stylesheet" href="css/bootstrap.min.css"/>
      <link rel="stylesheet" href="css/rash.css"/>
      <script src="js/jquery.min.js"><![CDATA[ ]]></script>
      <script src="js/bootstrap.min.js"><![CDATA[ ]]></script>
      <script src="js/rash.js"><![CDATA[ ]]></script>
      <title>It ROCS! -- The RASH Online Conversion Service</title>
      <meta about="#affiliation-1" property="schema:name" content="Department of Computer Science and Engineering, University of Bologna, Italy"/>
      <meta about="#affiliation-2" property="schema:name" content="Oxford e-Research Centre, University of Oxford, UK"/>
      <meta about="#affiliation-3" property="schema:name" content="Knowledge Media Institute, Open University, UK"/>
      <meta property="prism:keyword" content="HTML-based format"/>
      <meta property="prism:keyword" content="Scholarly HTML"/>
      <meta property="prism:keyword" content="RASH"/>
   <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"><![CDATA[ ]]></script></head>
   <body>
      <section role="doc-abstract">
         <h1>Abstract</h1>
         <p>In this poster paper we introduce the <em>RASH Online Conversion Service</em>, i.e., a Web application that allows the conversion of ODT documents into RASH, a HTML-based markup language for writing scholarly articles, and from RASH into LaTeX. This tool allows authors with no experience in HTML to easily produce HTML-based papers and supports the publishing process by generating also a LaTeX version according to the Springer LNCS and ACM ICPS layouts.</p>
      </section>
      <section>
         <h1>Introduction</h1>
         <p>The use of HTML as format for writing scholarly papers and submitting them to scholarly venues is a very popular, discussed and trendy topic within the scholarly domain. This is demonstrated by the existence of several posts within technical mailing lists of the Web community<a href="#ftn0"> </a>, by the birth of W3C community groups on such topic<a href="#ftn3"> </a>, by the development of HTML-based formats for scholarly articles<a href="#ftn4"> </a>, and by the increasing number of events that are experimenting with HTML-based formats for submissions, such as the SAVE-SD<a href="#ftn5"> </a> and LDOW<a href="#ftn6"> </a> workshops at WWW 2016, and the Extended Semantic Web Conference<a href="#ftn7"> </a>.</p>
         <p>In order to foster a wider adoption of these formats, frameworks for HTML-based papers should support the needs of all the actors involved in the production, delivery and fruition of scholarly articles, with particular regards to authors and publishers. Hence, this solution calls for a number of requirements that go well beyond those used on the Web. </p>
         <p>First of all, it is vital to support authors with a variety of tools to provide for an easy transition to the new format. To this end, authors should be allowed to keep using well-known current word processors rather than adopting HTML and/or pure text editors. We thus need to support the conversion from the main word processor formats (e.g., ODT and OOXML) to HTML formats, in particular when authors use only basic features, such as standard styles for paragraphs and tables. In addition, authors should be given the option to focus on the content and let appropriate tools handle the presentation layer after the conversion into the HTML-based format.</p>

在这个例子中，我想获取以这句话开头的 <p> 元素："The use of HTML as format for writing scholarly..."

Answer 1

BeautifulSoup 不支持 XPath 表达式，但您提到的 lxml 支持。

您可以使用 XPath 搜索元素，如下所示：

from lxml import etree

htmlparser = etree.HTMLParser()
tree = etree.parse(html_content, htmlparser)
tree.xpath(xpathselector)

如何获取具有相对 XPath 的元素？

How to get an element having its relative XPath?

python

xml

xpath

lxml

bs4