Grep 排除 curl 主体的注释  之间的匹配出现次数

Question

我对 linux 和 bash 脚本还很陌生。我正在尝试使用 curl 命令读取 xml 文件并计算其中 </entity> 这个词出现的次数。

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" | grep '</entity>' -oP | wc -l

这可以正常工作，但是 xml 文件包含如下注释，导致计数错误。

样本XML文件

.........
........
 <entity>
.......
.......
</entity>
........
........
<!--
.......
<entity>
........
</entity>
.......
.......
-->
<entity>
.......
........
</entity>

预期输出应为 2，因为其中一个匹配项在评论块内。

Answer 1

由于您使用的是 gnu-grep，这里有一个针对您的问题的 PCRE 正则表达式解决方案：

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text%2Fxml%3Bcharset%3Dutf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '[=10=]' '\n' |
wc -l

2

RegEx Demo

正则表达式详细信息：

(?s): 启用 DOTALL 模式以便点也匹配换行符
: 匹配一个注释块
(*SKIP)(*F): 跳过并失败这个注释块
|: 或
</entity>：匹配 </entity> 注释块外
tr '[=18=]' '\n'：将 NUL 字节转换为换行符
wc -l: 计算行数

Answer 2

像往常一样处理 XML，正则表达式是错误的工具。使用了解格式的东西。例如，使用 xmllint 和一些 XPath:

curl ... | xmllint --xpath 'count(//entity)' -

（注意尾部的 -；与许多程序不同，如果未在命令行中给出文件名，xmllint 将不会自动从标准输入读取）

Answer 3

使用您显示的示例，请尝试以下 awk 代码。在 GNU awk.

中编写和测试

your_curl_command | 
awk -v RS="" '
match([=10=],/(^|\n)<!--[^-]*-->/){
  val=substr([=10=],RSTART,RLENGTH)
  gsub(val,"")
}
END{
  while(match([=10=],/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){
    count++
    [=10=]=substr([=10=],RSTART+RLENGTH)
  }
  print count
}
'

说明：为以上代码添加详细说明。

your_curl_command |                ##Running curl command and sending its output to awk command.
awk -v RS="" '                     ##Setting RS as NULL for this awk program.
match([=11=],/(^|\n)<!--[^-]*-->/){    ##Using match function of awk where using regex (^|\n)<!--[^-]*-->(explained below)
  val=substr([=11=],RSTART,RLENGTH)    ##if match of regex is found then assigning sub string value of matched value to val here.
  gsub(val,"")                     ##Using gsub(Global substitution) function to substitute globally val with NULL in current line in whole line.
}
END{                               ##Starting END block of this awk program from here.
  while(match([=11=],/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){  ##Using while loop to match regex (\n|^)[[:space:]]*<entity>[^<]*<\/entity> in match function to get all the matches to get count.
    count++                        ##Adding 1 to count variable here.
    [=11=]=substr([=11=],RSTART+RLENGTH)   ##Assigning rest of line value to current line to avoid previous match.
  }
  print count                      ##Printing count value here.
}
'

第一个正则表达式的解释（(^|\n)）：

(^|\n)    ##Matching either starting of value OR new line here.
<!--[^-]* ##Followed by <!-- till next value of - here.
-->       ##Followed by --> here.

第二个正则表达式的解释((\n|^)[[:space:]]*<entity>[^<]*<\/entity>):

(\n|^)                ##Matching new line OR starting of value.
[[:space:]]*<entity>  ##Followed by spaces(0 or more occurrence) followed by <entity>
[^<]*                 ##Followed by matching just before <
<\/entity>            ##Followed by </entity> here.

Answer 4

gawk/mawk/mawk2/nawk '
BEGIN {
 1      FS = RS = "^$"
 1      _____ = "[<][\/]entity[>]"
 1      ____ = ""
 1      ___ =   ""
 1      __ = ("[\n][<][!]")(_="[-][-][\n]")
 1      sub("......","[\n]&[>]",_)
}

# Rule(s)

 1  ($!-_=gsub(_____,"&",
     $((  gsub(__,____)*gsub(_, ___)*\
          gsub(____"[^"(___)"]*"___,""))~"")))_'

Grep 排除 curl 主体的注释  之间的匹配出现次数

Grep exclude count of occurence match between comments  of curl body

regex

bash

awk

grep

curl