按顺序混合非重叠的每克

Mixing non-overlapping everygrams in order

我正在寻找一种方法来从匹配输入句子的长度为 n 的每克生成序列:

给定一个句子:"Break this into sequences"n = 3

我想创建序列:

("Break", "this", "into", "sequences")
("Break", "this", "into sequences")
("Break", "this into", "sequences")
("Break this", "into", "sequences")
("Break this", "into sequences")
("Break", "this into sequences")
("Break this into", "sequences")

nltkeverygram 包,但我不太确定如何使用它来实现我的目标。

为简单起见,我尝试调整问题以关注字符,即

将这些视为字符语法可能会有所帮助(并且,如 rici 所建议的那样,分隔字符 [为清楚起见,显示有和没有间距]):

abcd 转到:

(a, b, c, d)       (a, b, c, d)
(a, b, c  d)       (a, b, cd)
(a, b  c, d)       (a, bc, d)
(a  b, c, d)       (ab, c, d)
(a  b, c  d)       (ab, cd)
(a, b  c  d)       (a, bcd)
(a  b  c, d)       (abc, d)

为清楚起见,给定一个 n 作为最大大小的 n-gram,这应该概括为任何长度;所以,对于 abcden=3 我们有:

(a, b, c, d, e)     (a, b, c, d, e)
(a, b, c, d  e)     (a, b, c, de)
(a, b, c  d, e)     (a, b, cd, e)
(a, b  c, d  e)     (a, bc, d, e)
(a  b, c, d, e)     (ab, c, d, e)
(a, b  c, d  e)     (a, bc, de)
(a  b, c, d  e)     (ab, c, de)
(a  b, c  d, e)     (ab, cd, e)
(a, b, c  d  e)     (a, b, cde)
(a, b  c  d, e)     (a, bcd, e)
(a  b  c, d, e)     (abc, d, e)
(a  b, c  d  e)     (ab, cde)
(a  b  c, d  e)     (abc, de)

我想我可能需要生成一个语法,例如:

exp ::= ABC, d | a, BCD
ABC ::= AB, c | A, BC
BCD ::= BC, d | b, CD
AB ::= A, b | a, B
BC ::= B, c | b, C
CD ::= C, d | c, D
A ::= a
B ::= b
C ::= c
D ::= d

并找到句子的所有解析,但肯定必须有一种程序方法来解决这个问题?

也许 space 你的例子稍微有点帮助:

(a , b , c , d)
(a , b , c   d)
(a , b   c , d)
(a   b , c , d)
(a   b , c   d)
(a , b   c   d)
(a   b   c , d)
(a   b   c   d)  # added for completeness

看一下,很明显区分行的是逗号的存在与否,这是一种典型的二元选择。一个逗号可以去三个地方,所以有八种可能,对应三位数的八个二进制数。

列出这些可能性的最简单方法是从 0 0 0 数到 1 1 1


对于你修改的问题,其中有一个部分的最大长度,Python 中的一个简单递归解决方案是:

def kgram(k, v):
    'Generate all partitions of v with parts no larger than k'
    def helper(sfx, m):
        if m == 0: yield sfx
        else:
            for i in range(1, min(k, m)+1):
                yield from helper([v[m-i:m]]+sfx, m-i)

    yield from helper([], len(v))

这是一个快速测试:

>>> for p in gram(3, 'one two three four five'.split()): print(p)
... 
[['one'], ['two'], ['three'], ['four'], ['five']]
[['one', 'two'], ['three'], ['four'], ['five']]
[['one'], ['two', 'three'], ['four'], ['five']]
[['one', 'two', 'three'], ['four'], ['five']]
[['one'], ['two'], ['three', 'four'], ['five']]
[['one', 'two'], ['three', 'four'], ['five']]
[['one'], ['two', 'three', 'four'], ['five']]
[['one'], ['two'], ['three'], ['four', 'five']]
[['one', 'two'], ['three'], ['four', 'five']]
[['one'], ['two', 'three'], ['four', 'five']]
[['one', 'two', 'three'], ['four', 'five']]
[['one'], ['two'], ['three', 'four', 'five']]
[['one', 'two'], ['three', 'four', 'five']]