是否有现成的无监督基于多字符串的模式发现 library/software?

Is there a ready-to-use unsupervised multi-string-based pattern discovery library/software?

strace是一个跟踪系统调用和信号的命令。其输出示例:

poll([{fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=104, events=POLLIN}], 5, 11) = 0 (Timeout)
recvmsg(30, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f946e0c56e8, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7f946e0c5698, FUTEX_WAKE_PRIVATE, 1) = 1
recvmsg(30, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(31, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=104, events=POLLIN}], 5, 0) = 0 (Timeout)
recvmsg(30, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f946e0c56e8, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7f946e0c5698, FUTEX_WAKE_PRIVATE, 1) = 1
recvmsg(30, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(31, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)

它是高度模式化的——有没有现成的软件可以读取上面的输入,然后无监督地识别模式,比如:

====================================================
Patterns
====================================================

Pattern 1 (P1): {fd=, events=}

P2: P1
    where =POLLIN

P3: [P2a, P2b, P2c, P2d, P2e]
    where a.=, b.=, c.=, d.=, e.=

P4: poll(P3, , ) = 0 (Timeout)
    where P3.= P3.= P3.= P3.= P3.=

P6: recvmsg(, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)

P7: futex(, FUTEX_WAKE_PRIVATE, ) = 1

P8:
    P4a
    P6b
    P7c
    P7d
    P6e
    P6f
where a.= a.= ...


====================================================
Output
====================================================

P8 where =11, =12, ...
P8 where =11, =12, ...

是否有已经实现的随时可用的无监督宇宙?

压缩算法就是很好的例子。查看 deflate、gzip、xz 和 lzma 的理论和实现。