Python 中的网页抓取，与路径混淆

Question

我有基本的抓取知识。这是一个基本示例：

page = requests.get('some_website.com')
tree = html.fromstring(page.text)
desc = tree.path('//div[@class = "my class"]/text()')

我的 desc 会 return 里面的任何内容 div。但是如果我的 javascript 更复杂

我该怎么办

<tr>
    <th class="my class">some text</th>
    <td>some text</td>
</tr>

我只需要 <td></td> 里面的部分 <tr></tr> 如果 <tr> 位于 <div>

内，我将如何进行

Answer 1

您可能应该阅读 XPath 教程以更好地理解。

I need only the part that is inside <td></td> that is inside <tr></tr> And how would I proceed if the <tr> would be inside a <div>

你的情况是：

//div[@class = "my class"]//tr/td/text()

如果你事先知道"some text"，你可以用following-sibling横着走：

//div[@class = "my class"]//th[. = "some text"]/following-sibling::td/text()

Web Scraping in Python, confused with path