使用许多 <h2> 标记中的第一个重命名 HTML 文件,如果包含正斜杠,则将其替换为连字符

Renaming HTML files using the first of many <h2> tags, if forward slash is contained then replace this with hyphen

我有一个包含一堆 html 个文件的文件夹:

我想根据每个文件中的第一个 h2 标签重命名每个文件,如果标签包含正斜杠,则应将斜杠替换为连字符。

所以如果 SMG6E30A14100000000DAAT00.html 包含

</head><body><h2>Side Impact/Sensor (Second) Replacement</h2><a name="iR01"></a><h2><b>Removal</b></h2>

我希望脚本将文件重命名为 Impact-Sensor (Second) Replacement.html

and if(第一个 h2 标签之间没有斜杠)

<h2>Front Seat Belt Replacement</h2>SRS components are located in this area. <a href="./SMG6E00H46400000000DAAT00.html">Review the SRS component locations</a> and the <a href="./SMG6E00H46400000000AAAT00.html">precautions and procedures</a> in the SRS before doing repairs or service.<br><br>NOTE: Check the front seat belts for damage, and replace them if necessary. Be careful not to damage them during removal and installation.<br><br><a name="iR01"></a><h2><b>Front Seat Belt</b></h2>

相应地重命名为 Front Seat Belt Replacement.html

如何在 linux 上执行此操作?

以下命令 returns 所需的文件名 test.html

< ./test.html tr -d '\n' | grep -oP -m 1 '(?<=<h2>).*?(?=</h2>)' | head -1 | tr '/' '-'

您可以创建一个 shell 脚本,循环使用它来扫描所有文件,获取新文件名并重命名它们。

for filename in ./input/*.html; do

    newname=$(< ${filename} tr -d '\n' | grep -oP -m 1 '(?<=<h2>).*?(?=</h2>)' | head -1 | tr '/' '-')
    mv ${filename} "./output/${newname}.html"

done