xpath 在 bash 中解析 table

Question

我有一个 html table 我想用 bash 解析出来（注意：我已经使用 R 来执行此操作，但想尝试在 bash 中轻松地与另一个 shell 脚本集成）。

table可以从下面的url得到： http://faostat.fao.org/site/384/default.aspx

通过查看源代码 - 特定 table 的 xpath 参考是：

//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]

如何直接从 bash 将此 table 解析为 csv 文件？

我尝试了以下方法：

curl "http://faostat.fao.org/site/384/default.aspx" | xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]' > test.txt

这只是 returns test.txt 的空白文本。

谁能帮我在 bash 中使用 xpath 解析出有效的 html table 并创建它的 CSV 文件？

感谢任何帮助。

Answer 1

//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/tr （也就是说，将 /tr 附加到您问题中的 XPath 表达式）将只抓取每一行，并跳过 table 包装器（您不需要不需要在你的输出中做任何事情）。

然后您还需要通过 sed 或 perl 或其他方式传输 xmllint --xpath 输出：

示例：perl 版本

wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
     2>/dev/null \
   | perl -pe 's/<tr[^>]+>//' \
   | perl -pe 's/<\/tr>//' \
   | perl -pe 's/^\s+<t[dh][^>]*>//' \
   | perl -pe 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | perl -pe 's/<\/t[dh]>//' \
   | grep -v '^\s*$'

示例：sed 版本

wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
     2>/dev/null \
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | sed -E 's/<\/t[dh]>//' \
   | grep -v '^\s*$'

在这两种情况下，grep -v '^\s*$' 只是为了删除空行。

这不是严格意义上的 CSV；它用 |（竖线）字符而不是逗号分隔 fields/cells——因为某些（许多）字段本身也有逗号和引号。如果您真的是 CSV，请向下滚动并阅读下面的 如何为这种情况生成真正的 CSV。

改用 python 和 lxml

作为 xmllint --xpath 的替代方法，您可以使用 Python 和 lxml.html 库：

wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | python -c "import lxml.html as html; import sys; \
       expr = sys.argv[1]; print '\n'.join([html.tostring(el) \
       for el in html.parse(sys.stdin).xpath(expr)])" \
       '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]//tr' \
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | sed -E 's/<\/t[dh]>//' \
   | grep -v '^\s*$'

使用`column`和`colrm`命令格式化输出

如果您希望在控制台中读取 pretty-printed/formatted column/table 结果视图并 scroll/page 通过，请将输出进一步输送到 column 和 colrm 命令，像这样：

wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
     2>/dev/null \
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | sed -E 's/<\/t[dh]>//' \
   | grep -v '^\s*$' \
   | column -t -s '|' \
   | colrm 14 21 | colrm 20 28 | colrm 63 95 | colrm 80

这将为您提供如下所示的输出结果：

使用 column 和 colrm 格式化的结果

Group Name         Item FAO Code    Item HS+ Code    Item Name      Definition
Crops              800              5304_c           Agave fib      Including int
Crops              221              0802.11_a        Almonds,       Prunus amygda
Crops              711              0909             Anise, ba      Include: anis
Crops              515              0808.10_a        Apples         Malus pumila;
Crops              526              0809.10_a        Apricots       Prunus armeni
…

或者，您可以使用 cut 命令而不是 colrm 来获得相同的格式。

如何生成真正的 CSV

如果不是像上面那样的 pretty-printed/formatted 输出，您确实想要真正的 CSV，那么您还必须在字段周围发出引号，并且 CSV 转义字段内的现有引号；像这样：

示例：真正的 CSV 输出

wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/tr' - \
   | sed -E 's/"/""/g' \ 
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>/"/' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/","/g' \
   | sed -E 's/<\/t[dh]>/"/' \
   | grep -v '^\s*$'

使用 CSV 的工具显然希望看到所有引号字符一起转义为两个引号字符；例如，下面是单词 ""fufu""。

  "In West Africa they are consumed mainly as ""fufu"", a stiff glutinous dough."

所以上面代码片段的 sed -E 's/"/""/g' 部分就是这样做的。

上述示例的 CSV 输出

"Group Name","Item FAO Code","Item HS+ Code","Item Name ","Definition"
"Crops","800","5304_c","Agave fibres nes","Including inter alia: Haiti hemp…"
"Crops","221","0802.11_a","Almonds, with shell","Prunus amygdalus; P. communis…"
"Crops","711","0909","Anise, badian, fennel, coriander","Include: anise…"

免责声明：您应该避免对 HTML/XML

进行基于正则表达式的处理

(强制免责声明) 综上所述，很多人会告诉你基于正则表达式的 HTML/XML 处理是笨拙+容易出错的。确实如此，所以请谨慎使用上述方法（如果有的话）。

如果您有时间做对，您应该做的是：改用一个好的网络抓取库，或者使用Python+lxml 实际处理从评估 XPath 表达式返回的结果（而不是将结果字符串化），或使用 xsltproc 或其他一些 XSLT 引擎。

但是您只需要在命令行中快速使用一些东西，以上就可以完成工作。然而，它很脆弱，所以如果输出的某些部分以某种意想不到的方式损坏，请不要感到震惊。 如果您想要 HTML/XML 的强大功能，请不要使用基于正则表达式的方法。

xpath 在 bash 中解析 table

xpath parse table in bash

csv

bash

xpath

xmllint

改用 python 和 lxml

使用column和colrm命令格式化输出

如何生成真正的 CSV

免责声明：您应该避免对 HTML/XML

使用`column`和`colrm`命令格式化输出