用sed中的文件内容替换文件名占位符

Question

我正在尝试编写一个基本脚本来编译 HTML 文件包含。前提是这样的：

我有 3 个文件

test.html

<div>
   @include include1.html

   <div>content</div>

   @include include2.html
</div>

include1.html

<span>
   banana
</span>

include2.html

<span>
   apple
</span>

我想要的输出是：

output.html

<div>
   <span>
      banana
   </span>

   <div>content</div>

   <span>
      apple
   </span>
</div>

我试过以下方法：

sed "s|@include \(.*)|$(cat )|" test.html >output.html
这个returnscat: 1: No such file or directory
sed "s|@include \(.*)|cat |" test.html >output.html
这运行s 但给出：

output.html
```
<div>
   cat include1.html

   <div>content</div>

   cat include2.html
</div>
```

关于如何在 sed 中运行 cat 使用组替换有什么想法吗？或者另一种解决方案。

Answer 1

您可以使用此 bash 脚本，该脚本使用正则表达式检测以 @include 开头的行并使用捕获组抓取包含文件名：

re="@include +([^[:space:]]+)"

while IFS= read -r line; do
    [[ $line =~ $re ]] && cat "${BASH_REMATCH[1]}" || echo "$line"
done < test.html

<div>
<span>
   banana
</span>

   <div>content</div>

<span>
   apple
</span>
</div>

或者您可以使用这个awk脚本来做同样的事情：

awk ' == "@include"{system("cat " ); next} 1' test.html

Answer 2

如果你有 GNU sed，你可以使用 the e flag 到 s 命令，它将当前模式 space 作为 shell 命令执行并将其替换为输出：

$ sed 's/@include/cat/e' test.html
<div>
<span>
   banana
</span>

   <div>content</div>

<span>
   apple
</span>
</div>

请注意，这不会处理缩进，因为包含的文件没有任何缩进。像 Tidy 这样的 HTML 美化器可以进一步帮助你：

$ sed 's/@include/cat/e' test.html | tidy -iq --show-body-only yes
<div>
  <span>banana</span>
  <div>
    content
  </div><span>apple</span>
</div>

GNU 有一个读取文件的命令，r，但是文件名不能即时生成。

正如 Ed 在他的评论中指出的那样，这很容易受到 shell 命令注入的攻击：如果你有类似

@include $(date)

您会注意到 date 命令实际上是运行。这是可以避免的，但是如果原始解决方案是window那么简洁：

sed 's|@include \(.*\)|cat "$(/usr/bin/printf "%q" '\'''\'')"|e' test.html

这仍然将 @include 替换为 cat，但另外将该行的其余部分换行到带有 printf "%q" 的命令替换中，因此诸如

之类的行

@include include1.html

变成

cat "$(/usr/bin/printf "%q" 'include1.html')"

在作为命令执行之前。这扩展为

cat include1.html

但如果文件名为 $(date)，它会变成

cat '$(date)'

（注意单引号），防止执行注入的命令。

因为 s///e 似乎使用 /bin/sh 作为它的 shell，你不能依赖 Bash 中的 %q 格式规范 printf 存在，因此是 printf 二进制文件的绝对路径。为了可读性，我将 s 命令的 / 分隔符更改为 |（这样我就不必转义 \/usr\/bin\/printf）。

最后，围绕 </code> 的引用混乱是将单引号变成单引号字符串：<code>'\'' 变为 '.

Answer 3

我在 "Applications" 然后 "d)" 下写了这个 15-20 years ago to recursively include files and it's included in the article I wrote about how/when to use getline。我现在调整它以与您的特定“@include”指令一起使用，提供缩进以匹配“@include”缩进，并添加了防止无限递归的保护措施（例如，文件 A 包含文件 B，文件 B 包含文件 A）：

$ cat tst.awk
function read(file,indent) {
    if ( isOpen[file]++ ) {
        print "Infinite recursion detected" | "cat>&2"
        exit 1
    }

    while ( (getline < file) > 0) {
        if ( == "@include") {
             match([=10=],/^[[:space:]]+/)
             read(,indent substr([=10=],1,RLENGTH))
        } else {
             print indent [=10=]
        }
    }
    close(file)

    delete isOpen[file]
}

BEGIN{
   read(ARGV[1],"")
   exit
}

.

$ awk -f tst.awk test.html
<div>
   <span>
      banana
   </span>

   <div>content</div>

   <span>
      apple
   </span>
</div>

请注意，如果 include1.html 本身包含一个 @include ... 指令，那么它也会受到尊重，依此类推。看：

$ for i in test.html include?.html; do printf -- '-----\n%s\n' "$i"; cat "$i"; done
-----
test.html
<div>
   @include include1.html

   <div>content</div>

   @include include2.html
</div>
-----
include1.html
<span>
   @include include3.html
</span>
-----
include2.html
<span>
   apple
</span>
-----
include3.html
<div>
   @include include4.html
</div>
-----
include4.html
<span>
   grape
</span>

.

$ awk -f tst.awk test.html
<div>
   <span>
      <div>
         <span>
            grape
         </span>
      </div>
   </span>

   <div>content</div>

   <span>
      apple
   </span>
</div>

对于非 GNU awk，我预计它会在大约 20 级递归后失败并出现 "too many open files" 错误，所以如果你需要比这更深入或者你必须写自己的文件管理代码。

用sed中的文件内容替换文件名占位符

Replacing filename placeholder with file contents in sed

bash

awk

sed

include

cat