在 bash 中通过 XPath 获取 HTML 个元素

Question

我正在尝试解析页面（Kaggle Competitions) with xpath on MacOS as described in another 所以问题：

curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'

这只是在 table 中得到 link 的 href。

但是 xpath 没有返回值，而是开始验证 .html 和 returns 错误，例如 undefined entity at line 89, column 13, byte 2964.

因为 man xpath 不存在并且 xpath --help 什么也没有结束，所以我被卡住了。此外，许多类似的解决方案与 GNU 发行版中的 xpath 有关，而不是在 MacOS 中。

在 bash 中是否有通过 XPath 获取 HTML 元素的正确方法？

Answer 1

Getting HTML elements via XPath in bash

来自 html 文件（无效 xml）

一种可能是使用 xsltproc。（我希望它适用于 MAC）。 xsltproc 有一个选项 --html 可以使用 html 作为输入。但是你需要有一个 xslt 样式表。

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 

  <xsl:template match="/*">
    <xsl:value-of  select="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" />
  </xsl:template>

</xsl:stylesheet>

请注意，xapht 已更改。输入文件中没有tbody。调用 xsltproc:

xsltproc --html  test.xsl competitions.html 2> /dev/null

在 html 中抱怨错误的 xslproc 被忽略（发送到 /devn/null）。

输出为：/c/R

要从命令行使用不同的 xpath 表达式，您可以使用 xslt 模板并替换 __xpath__。

例如xslt 模板：

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 
  <xsl:template match="/*">
    <xsl:value-of  select="__xpaht__" />
  </xsl:template>
</xsl:stylesheet>

并使用（例如）sed 进行替换。

 sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
 xsltproc --html  test.xsl competitions.html 2> /dev/null