我可以在 XPath 中访问 parent 的子项吗?
Can I access the subchild of a parent in XPath?
正如标题所述,我有一些来自 http://chem.sis.nlm.nih.gov/chemidplus/name/acetone that I am parsing and want to extract some data like the Acetone under MeSH Heading from my similar post
的 HTML 代码
<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds">
<button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">↔</button>
<h3>Name of Substance</h3>
<div class="yui3-g-r">
<div class="yui3-u-1-4">
<ul>
<li id="ds2">
<div>2-Propanone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds3">
<div>Acetone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds4">
<div>Acetone [NF]</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds5">
<div>Dimethyl ketone</div>
</li>
</ul>
</div>
</div>
<h3>MeSH Heading</h3>
<ul>
<li id="ds6">
<div>Acetone</div>
</li>
</ul>
</div>
</div>
以前在其他页面中我会 mesh_name = tree.xpath('//*[text()="MeSH Heading"]/..//div')[1].text_content()
来提取数据,因为其他页面具有类似的结构,但现在我发现情况并非如此,因为我没有考虑到不一致之处。那么,有没有一种方法可以在转到我想要的节点之后获取它的子节点,从而实现不同页面之间的一致性?
做tree.xpath('//*[text()="MeSH Heading"]//preceding-sibling::text()[1]')
行吗?
据我了解,您需要按标题获取项目列表。
如何制作一个适用于 "Names and Synonyms" 容器中每个标题的可重用函数:
from lxml.html import parse
tree = parse("http://chem.sis.nlm.nih.gov/chemidplus/name/acetone")
def get_contents_by_title(tree, title):
return tree.xpath("//h3[. = '%s']/following-sibling::*[1]//div/text()" % title)
print get_contents_by_title(tree, "Name of Substance")
print get_contents_by_title(tree, "MeSH Heading")
打印:
['2-Propanone', 'Acetone', 'Acetone [NF]', 'Dimethyl ketone']
['Acetone']
正如标题所述,我有一些来自 http://chem.sis.nlm.nih.gov/chemidplus/name/acetone that I am parsing and want to extract some data like the Acetone under MeSH Heading from my similar post
<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds">
<button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">↔</button>
<h3>Name of Substance</h3>
<div class="yui3-g-r">
<div class="yui3-u-1-4">
<ul>
<li id="ds2">
<div>2-Propanone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds3">
<div>Acetone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds4">
<div>Acetone [NF]</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds5">
<div>Dimethyl ketone</div>
</li>
</ul>
</div>
</div>
<h3>MeSH Heading</h3>
<ul>
<li id="ds6">
<div>Acetone</div>
</li>
</ul>
</div>
</div>
以前在其他页面中我会 mesh_name = tree.xpath('//*[text()="MeSH Heading"]/..//div')[1].text_content()
来提取数据,因为其他页面具有类似的结构,但现在我发现情况并非如此,因为我没有考虑到不一致之处。那么,有没有一种方法可以在转到我想要的节点之后获取它的子节点,从而实现不同页面之间的一致性?
做tree.xpath('//*[text()="MeSH Heading"]//preceding-sibling::text()[1]')
行吗?
据我了解,您需要按标题获取项目列表。
如何制作一个适用于 "Names and Synonyms" 容器中每个标题的可重用函数:
from lxml.html import parse
tree = parse("http://chem.sis.nlm.nih.gov/chemidplus/name/acetone")
def get_contents_by_title(tree, title):
return tree.xpath("//h3[. = '%s']/following-sibling::*[1]//div/text()" % title)
print get_contents_by_title(tree, "Name of Substance")
print get_contents_by_title(tree, "MeSH Heading")
打印:
['2-Propanone', 'Acetone', 'Acetone [NF]', 'Dimethyl ketone']
['Acetone']