如何根据字符串中的相似性对字符串进行分组
How to group strings based on similarities in the string
我有一个生成按频率汇总错误的 splunk 查询
index="pc_1" LogLevel=ERROR
| eval Message=split(_raw,"|")
| stats count(LogLevel) as Frequency by Message
| sort -Frequency
这会产生以下形式的结果
Message
Frquency
No such user
137
unable to deliver mail to example@email.com: Unable to reach server
70
unable to deliver mail to example1@email.com: Unable to reach server
43
unable to authenticate user 3456
8
unable to deliver mail to example2@email.com: Unable to reach server
6
unable to authenticate user 2321
5
unable to authenticate user 13321
3
...
.
...
.
...
.
unable to deliver mail to examplen@email.com: Unable to reach server
1
正如您在生成的结果中注意到的那样,一些类似的错误正在根据用户电子邮件 ID 和机器 ID 的差异进行拆分。
我正在寻找一种方法,可以根据字符串的相似性对其进行分组。目前我正在使用的是用普通的正则表达式替换字符串,然后找到频率
index="pc_1" LogLevel=ERROR
| eval Message=split(_raw,"|")
| eval Message=replace("unable to deliver mail to (.)* Unable to reach server", "unable to deliver mail to [email]: Unable to reach server")
| eval Message=replace("unable to authenticate user \d+", "unable to authenticate user [userId]")
| stats count(LogLevel) as Frequency by Message
| sort -Frequency
这种方法可行,但非常麻烦,因为有许多不同类型的错误,如果要实施此解决方案,则需要检查每个错误并为每个错误开发一个正则表达式。
有没有一种方法可以通过可以更有效地汇总此错误的查询来改进此问题?
后人回答:
也许 cluster
命令会有所帮助。它将类似的消息组合在一起。
我有一个生成按频率汇总错误的 splunk 查询
index="pc_1" LogLevel=ERROR
| eval Message=split(_raw,"|")
| stats count(LogLevel) as Frequency by Message
| sort -Frequency
这会产生以下形式的结果
Message | Frquency |
---|---|
No such user | 137 |
unable to deliver mail to example@email.com: Unable to reach server | 70 |
unable to deliver mail to example1@email.com: Unable to reach server | 43 |
unable to authenticate user 3456 | 8 |
unable to deliver mail to example2@email.com: Unable to reach server | 6 |
unable to authenticate user 2321 | 5 |
unable to authenticate user 13321 | 3 |
... | . |
... | . |
... | . |
unable to deliver mail to examplen@email.com: Unable to reach server | 1 |
正如您在生成的结果中注意到的那样,一些类似的错误正在根据用户电子邮件 ID 和机器 ID 的差异进行拆分。 我正在寻找一种方法,可以根据字符串的相似性对其进行分组。目前我正在使用的是用普通的正则表达式替换字符串,然后找到频率
index="pc_1" LogLevel=ERROR
| eval Message=split(_raw,"|")
| eval Message=replace("unable to deliver mail to (.)* Unable to reach server", "unable to deliver mail to [email]: Unable to reach server")
| eval Message=replace("unable to authenticate user \d+", "unable to authenticate user [userId]")
| stats count(LogLevel) as Frequency by Message
| sort -Frequency
这种方法可行,但非常麻烦,因为有许多不同类型的错误,如果要实施此解决方案,则需要检查每个错误并为每个错误开发一个正则表达式。
有没有一种方法可以通过可以更有效地汇总此错误的查询来改进此问题?
后人回答:
也许 cluster
命令会有所帮助。它将类似的消息组合在一起。