如何剪切 HTML 文件（删除两个标签之外的任何内容）？

Question

当这是我的 HTML 示例文档时：

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>title</title>
  </head>
  <body>
    <iframe></iframe>
    <div class="text">TEST</div>
    <div id="trend" data-app="openableBox" class="box sub-box">
        <div class="box-header">
            <h1><span>Highlights</span></h1>
        </div>
    </div>
  </body>
</html>

如何提取

<iframe></iframe>
<div class="text">TEST</div>

删除之前 <iframe>和之后的所有内容（开始于）<div id="trend">?

如果你能帮助我，谢谢。

Answer 1

从命令行处理 HTML/XML 数据时 - 应使用适当的 HTML/XML 解析器。
xmllint就是其中之一。

xmllint --html --xpath '//body/*[self::iframe or self::div[@class="text"]]' input.html

输出：

<iframe></iframe><div class="text">TEST</div>

Answer 2

这是一个解决一般问题的解决方案，假设想要 select 一系列基于 HTML 的 "linearization" 的元素。此解决方案使用 pup to convert HTML to JSON, and then uses jq 执行线性化，selection，并转换回 HTML。

program.jq

想法是 "linearize" HTML 通过递归地将子级提升到顶层：

# Emit a stream by hoisting .children recursively.
# It is assumed that the input is an array, 
# and that .children is always an array.
def hoist:
  .[]
  | if type == "object" and has("children")
    then del(.children), (.children | hoist)
    else .
    end;

def indexof(condition):
  label $out
  | foreach .[] as $x (null; .+1;
      if ($x|condition) then .-1, break $out else empty end)
    // null;

# Reconstitute the HTML element
def toHtml:
  def k: . as $in | (keys_unsorted - ["tag", "text"])
  | reduce .[] as $k (""; . + " \($k)=\"\($in[$k])\"");
  def t: if .text then .text else "" end;
  "<\(.tag)\(k)>\(t)</\(.tag)>"
  ;

# Linearize and then select the desired range of elements
[hoist]
| indexof( .tag == "iframe") as $first
| indexof( .tag == "div" and .id=="trend") as $last
| .[$first:$last]
| .[]
| toHtml

调用：

pup 'json{}' < input.html | jq -rf program.jq

输出：

<iframe></iframe>
<div class="text">TEST</div>

如何剪切 HTML 文件（删除两个标签之外的任何内容）？

How to cut HTML file (drop anything outside two tags)?

html

awk

sed

sequential

jq

program.jq

调用：

输出：