使用 xpath 在 table 元素中找到所有 tr?
find all tr in a table element with xpath?
def parse_header(table):
ths = table.xpath('//tr/th')
if not ths:
ths = table.xpath('//tr[1]/td') # here is the problem, this will find tr[1]/td in all html file insted of this table
# bala bala something elese
doc = html.fromstring(html_string)
table = doc.xpath("//div[@id='divGridData']/div[2]/table")[0]
parse_header(table)
我想在我的 table 中找到所有 tr[1]/td
,但 table.xpath("//tr[1]/td")
仍然在 html 文件中找到所有。我怎样才能只在这个元素而不是所有 html 文件中找到?
编辑:
content = '''
<root>
<table id="table-one">
<tr>
<td>content from table 1</td>
<tr>
<table>
<tr>
<!-- this is content I do not want to get -->
<td>content from embeded table</td>
<tr>
</table>
</table>
</root>'''
root = etree.fromstring(content)
table_one = root.xpath('table[@id="table-one"]')
all_td_elements = table_one.xpath('//td') # so this give me too much!!!
现在我不想嵌入 table 内容,我该怎么做?
要查找作为上下文节点子元素的元素,请在 XPath 前添加句点 .
运算符。所以,我认为您正在寻找的 XPath 是:
.//tr[1]/td
这将 select td
个元素作为当前 table 的子元素,而不是在整个 HTML 文件中。
举个例子:
from lxml import etree
content = '''
<root>
<table id="table-one">
<tr>
<td>content from table 1</td>
<tr>
</table>
<table id="table-two">
<tr>
<td>content from table 2</td>
<tr>
</table>
</root>'''
root = etree.fromstring(content)
table_one = root.xpath('table[@id="table-one"]')
# this will select all td elements in the entire XML document (so two elements)
all_td_elements = table_one.xpath('//td')
# this will just select the single sub-element because of the period
just_sub_td_elements = table_one.xpath('.//td')
def parse_header(table):
ths = table.xpath('//tr/th')
if not ths:
ths = table.xpath('//tr[1]/td') # here is the problem, this will find tr[1]/td in all html file insted of this table
# bala bala something elese
doc = html.fromstring(html_string)
table = doc.xpath("//div[@id='divGridData']/div[2]/table")[0]
parse_header(table)
我想在我的 table 中找到所有 tr[1]/td
,但 table.xpath("//tr[1]/td")
仍然在 html 文件中找到所有。我怎样才能只在这个元素而不是所有 html 文件中找到?
编辑:
content = '''
<root>
<table id="table-one">
<tr>
<td>content from table 1</td>
<tr>
<table>
<tr>
<!-- this is content I do not want to get -->
<td>content from embeded table</td>
<tr>
</table>
</table>
</root>'''
root = etree.fromstring(content)
table_one = root.xpath('table[@id="table-one"]')
all_td_elements = table_one.xpath('//td') # so this give me too much!!!
现在我不想嵌入 table 内容,我该怎么做?
要查找作为上下文节点子元素的元素,请在 XPath 前添加句点 .
运算符。所以,我认为您正在寻找的 XPath 是:
.//tr[1]/td
这将 select td
个元素作为当前 table 的子元素,而不是在整个 HTML 文件中。
举个例子:
from lxml import etree
content = '''
<root>
<table id="table-one">
<tr>
<td>content from table 1</td>
<tr>
</table>
<table id="table-two">
<tr>
<td>content from table 2</td>
<tr>
</table>
</root>'''
root = etree.fromstring(content)
table_one = root.xpath('table[@id="table-one"]')
# this will select all td elements in the entire XML document (so two elements)
all_td_elements = table_one.xpath('//td')
# this will just select the single sub-element because of the period
just_sub_td_elements = table_one.xpath('.//td')