使用正则表达式搜索，但只用 sed 替换字符串的一部分

Question

我正在尝试替换任何出现的 cwe.mitre.org.*.html（正则表达式）URL 并删除 .html 扩展名，而不更改任何其他类型的 URL。

示例：

https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html

期望：

https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

有没有办法在 sed 或其他工具中执行此操作？

我已经试过 sed -Ei 's/cwe.mitre.org.*.html/<REPLACEMENT>/g' file.txt，但那行不通。有没有办法让 <REPLACEMENT> 成为正则表达式？ sed 手册似乎没有建议？

编辑：我对 sed 手册的看法是错误的。它确实提到了它，请参阅 https://www.gnu.org/software/sed/manual/sed.html 的“5.7 反向引用和子表达式”部分。

Answer 1

$ sed 's/\(cwe\.mitre\.org.*\)\.html//' file
https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

google sed 捕获组。

Answer 2

GNU AWK解决方案，令file.txt内容为

https://cwe.mitre.org/data/definitions/377.html
http://google.com/404.html

然后

awk '/cwe\.mitre\.org.*\.html/{sub(/\.html$/,"")}{print}' file.txt

给出输出

https://cwe.mitre.org/data/definitions/377
http://google.com/404.html

说明：如果您在行中找到提供的正则表达式，请使用空字符串替换 .html 后跟行尾 ($)。每一行，无论是否更改，print.

（在 GNU Awk 5.0.1 中测试）

Answer 3

使用

sed -Ei 's/(cwe\.mitre\.org.*)\.html//' file

解释

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    cwe                      'cwe'
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    mitre                    'mitre'
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    org                      'org'
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \.                       '.'
--------------------------------------------------------------------------------
  html                     'html'

反向引用由带括号的模式片段捕获的字符串部分。当您希望匹配项保留在结果中时，请使用反向引用。

Answer 4

另一种可能是

% sed '/cwe\.mitre\.org/s/\.html//' try.txt 
https://cwe.mitre.org/data/definitions/377
Nothing
hello.html
http://google.com/404.html

这并不明显优于公认的答案（例如，它会被 foo.html text http://cwe.mitre.org/bar.html 混淆，但其他答案也可能假设一行中只有一个相关的 URL ).然而，我提到它是作为对那个的补充，因为它有用地说明了 sed 命令可以以“地址”为前缀，其中可以包括正则表达式。此脚本删除包含 cvw.mitre.org.

的任何行上的 .html

这个特性经常被遗忘，只是偶尔有用，但在适当的时候，它可以避免在 s 'pattern' 槽和 back-references.

使用正则表达式搜索，但只用 sed 替换字符串的一部分

Search with regex but replace only a portion of the string with sed

unix

awk

grep

sed