如何使用 lxml 和 python 从 table 中找到特定的 xpath td class
How to find a specific xpath td class from a table with lxml and python
我正在尝试使用 Python lxml 从页面导入文本列表。
这是我目前所拥有的。
test_page.html 来源:
<html>
<head>
<title>Test</title>
</head>
<body>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr><td><a title="This page is cool" class="producttitlelink" href="about:mozilla">This page is cool</a></td></tr>
<tr height="10"></tr>
<tr><td class="plaintext">This is a really cool description for my really cool page.</td></tr>
<tr><td class="plaintext">Published: 7/15/15</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
</tbody>
</table>
</body>
Python代码:
from lxml import html
import requests
page = requests.get('http://127.0.0.1/test_page.html')
tree = html.fromstring(page.text)
description = tree.xpath('//table//td[@class="plaintext"]/text()')
>> print (description)
['This is a really cool description for my really cool page.', 'Published: 7/15/15', '\n\t\t\n\t\t\t\t\n\t\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t']
>>
然而,期望的最终结果是:
['This is a really cool description for my really cool page. Published: 7/15/15']
我原以为使用 [1] -
tree.xpath('//table//td[@class="plaintext"][1]/text()')
可能会让我收到第一行:
['This is a really cool description for my really cool page.']
但是它会拉取整个列表。
有没有办法为此 html 使用 xpath 指定单行或行列表?
你可以这样试试:
from lxml import html
source = """html posted in the question here"""
tree = html.fromstring(source)
tds = tree.xpath('//table//td[@class="plaintext"]/text()[normalize-space()]')
description = ' '.join(tds)
print(description)
应用于 text()
的 XPath 谓词[normalize-space()]
将 return 仅那些非空白文本节点。
使用有问题的HTML,上述代码的输出完全符合要求:
This is a really cool description for my really cool page. Published: 7/15/15
normalize-space() 很好的解决方案,我不知道,但 IRL 更好的是这样的:
' '.join([col.strip() for col in tree.xpath('//table//td[@class="plaintext"]/text()') if col.strip()])
它真的去掉了所有空单元格
我正在尝试使用 Python lxml 从页面导入文本列表。 这是我目前所拥有的。
test_page.html 来源:
<html>
<head>
<title>Test</title>
</head>
<body>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr><td><a title="This page is cool" class="producttitlelink" href="about:mozilla">This page is cool</a></td></tr>
<tr height="10"></tr>
<tr><td class="plaintext">This is a really cool description for my really cool page.</td></tr>
<tr><td class="plaintext">Published: 7/15/15</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
</tbody>
</table>
</body>
Python代码:
from lxml import html
import requests
page = requests.get('http://127.0.0.1/test_page.html')
tree = html.fromstring(page.text)
description = tree.xpath('//table//td[@class="plaintext"]/text()')
>> print (description)
['This is a really cool description for my really cool page.', 'Published: 7/15/15', '\n\t\t\n\t\t\t\t\n\t\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t']
>>
然而,期望的最终结果是:
['This is a really cool description for my really cool page. Published: 7/15/15']
我原以为使用 [1] -
tree.xpath('//table//td[@class="plaintext"][1]/text()')
可能会让我收到第一行:
['This is a really cool description for my really cool page.']
但是它会拉取整个列表。
有没有办法为此 html 使用 xpath 指定单行或行列表?
你可以这样试试:
from lxml import html
source = """html posted in the question here"""
tree = html.fromstring(source)
tds = tree.xpath('//table//td[@class="plaintext"]/text()[normalize-space()]')
description = ' '.join(tds)
print(description)
应用于 text()
的 XPath 谓词[normalize-space()]
将 return 仅那些非空白文本节点。
使用有问题的HTML,上述代码的输出完全符合要求:
This is a really cool description for my really cool page. Published: 7/15/15
normalize-space() 很好的解决方案,我不知道,但 IRL 更好的是这样的:
' '.join([col.strip() for col in tree.xpath('//table//td[@class="plaintext"]/text()') if col.strip()])
它真的去掉了所有空单元格