使用 jq 生成字段值频率计数

generate field-value frequency count with jq

我可以像这样查询 JSON 字段中的所有唯一值:

$ cat all.json | jq '.complianceState' | sort | uniq

"compliant"
"configManager"
"inGracePeriod"
"noncompliant"
"unknown"

我可以像这样迂腐地查询每个唯一字段值的频率计数:

$ cat all.json | jq '.complianceState' | grep '^"configManager"$' | wc -l

116

jq 中是否有一种方法可以一次完成所有操作以产生如下输出:

{
    "compliant" : 123000,
    "noncompliant" : 2000,
    "configManager" : 116
}

来自我的标准库:

# bag of words
# WARNING: this is not collision-free!
def bow(stream): 
  reduce stream as $word ({}; .[($word|tostring)] += 1);

有了这个,您可以使用过滤器:

bow(inputs | .complianceState)

与 -n 命令行选项一起使用。

总结

将所有这些放在一起的一种方法是将上述 jq 行放在一个文件中,比如说 bow.jq,然后按如下方式调用 jq:

jq -n -f bow.jq all.json

另一种方法是使用模块系统——有关详细信息,请参阅 jq 手册 and/or Cookbook

这是我使用的解决方案,它是一个自定义频率函数,即:

  • buckets/bins一个数组JSONvalues/objects由一个JQ表达式(桶key
  • 提供桶count(频率)
  • 为每个桶提供 percentage 个项目(四舍五入到小数点后两位)
  • 提供原始的 items 并放入桶中,
  • count 降序对存储桶进行排序。
def freq(expr):
  length as $total_count
    | group_by(expr)
    | map({
        key: (.[0] | expr),
        count: length,
        percent: (((length / $total_count * 10000 + 0.5) | floor) / 100),
        items: .
      })
    | sort_by(-.count)
  ;

例如,在我的 $HOME/.jq 中定义了上面的查询:

jq -n '
[                                                                                                                                                                                               
  {"complianceState": "a", "other": 0.5},
  {"complianceState": "b", "other": 1.2},
  {"complianceState": "a", "other": 1.7},
  {"complianceState": "c", "other": 5.3},
  {"complianceState": "b", "other": 1.5},
  {"complianceState": "e", "other": 0.6},
  {"complianceState": "c", "other": 3.4},
  {"complianceState": "c", "other": 5.9}
] | freq(.complianceState)'

会产生

[
  {
    "key": "c",
    "count": 3,
    "percent": 37.5,
    "items": [
      {"complianceState": "c", "other": 5.3},
      {"complianceState": "c", "other": 3.4},
      {"complianceState": "c", "other": 5.9}
    ]
  },
  {
    "key": "a",
    "count": 2,
    "percent": 25,
    "items": [
      {"complianceState": "a", "other": 0.5},
      {"complianceState": "a", "other": 1.7}
    ]
  },
  {
    "key": "b",
    "count": 2,
    "percent": 25,
    "items": [
      {"complianceState": "b", "other": 1.2},
      {"complianceState": "b", "other": 1.5}
    ]
  },
  {
    "key": "e",
    "count": 1,
    "percent": 12.5,
    "items": [
      {"complianceState": "e", "other": 0.6}
    ]
  }
]

对于您的情况,您需要使用 -s 将输入整合到 JSON 数组中。从那里,您可以将输出转换为所需的格式。例如

jq -s 'freq(.complianceState)
  | map({key, value: .count})
  | from_entries
' all.json

请注意,使用 freq 函数,您可以按任意表达式进行分组。例如 freq((.other / 1.5) | floor) 如果您希望获得 histogram-like 合并。