如何跳过 XPath 表达式中带有注释的段落？

Question

我正在尝试使用以下 Xpath 表达式抓取 this 等网站：

.//div[@class="tresc"]/p[not(starts-with(text(), "<!--"))]

问题是第一段是评论部分，所以我想跳过它：

<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:HyphenationZone>21</w:HyphenationZone>
<w:PunctuationKerning />
<w:ValidateAgainstSchemas />
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid
<w:IgnoreMixedContent>false</w:IgnoreMixedContent
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables />
<w:SnapToGridInCell />
<w:WrapTextWithPunct />
<w:UseAsianBreakRules />
<w:DontGrowAutofit />
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]-->

不幸的是，我的表达没有跳过评论的段落。有人知道我做错了什么吗？

Answer 1

评论不是 text() 的一部分，它们构成了自己的节点：comment()。要排除包含注释的 p，请使用

p[not(comment())]

如何跳过 XPath 表达式中带有注释的段落？

How to skip paragraphs with comments in XPath expression?

xpath

scrapy

web-scraping

xpath-2.0