按顺序混合非重叠的每克

Question

我正在寻找一种方法来从匹配输入句子的长度为 n 的每克生成序列：

给定一个句子："Break this into sequences"和n = 3

我想创建序列：

("Break", "this", "into", "sequences")
("Break", "this", "into sequences")
("Break", "this into", "sequences")
("Break this", "into", "sequences")
("Break this", "into sequences")
("Break", "this into sequences")
("Break this into", "sequences")

nltk 有 everygram 包，但我不太确定如何使用它来实现我的目标。

为简单起见，我尝试调整问题以关注字符，即

将这些视为字符语法可能会有所帮助（并且，如 rici 所建议的那样，分隔字符 [为清楚起见，显示有和没有间距]）：

abcd 转到：

(a, b, c, d)       (a, b, c, d)
(a, b, c  d)       (a, b, cd)
(a, b  c, d)       (a, bc, d)
(a  b, c, d)       (ab, c, d)
(a  b, c  d)       (ab, cd)
(a, b  c  d)       (a, bcd)
(a  b  c, d)       (abc, d)

为清楚起见，给定一个 n 作为最大大小的 n-gram，这应该概括为任何长度；所以，对于 abcde 和 n=3 我们有：

(a, b, c, d, e)     (a, b, c, d, e)
(a, b, c, d  e)     (a, b, c, de)
(a, b, c  d, e)     (a, b, cd, e)
(a, b  c, d  e)     (a, bc, d, e)
(a  b, c, d, e)     (ab, c, d, e)
(a, b  c, d  e)     (a, bc, de)
(a  b, c, d  e)     (ab, c, de)
(a  b, c  d, e)     (ab, cd, e)
(a, b, c  d  e)     (a, b, cde)
(a, b  c  d, e)     (a, bcd, e)
(a  b  c, d, e)     (abc, d, e)
(a  b, c  d  e)     (ab, cde)
(a  b  c, d  e)     (abc, de)

我想我可能需要生成一个语法，例如：

exp ::= ABC, d | a, BCD
ABC ::= AB, c | A, BC
BCD ::= BC, d | b, CD
AB ::= A, b | a, B
BC ::= B, c | b, C
CD ::= C, d | c, D
A ::= a
B ::= b
C ::= c
D ::= d

并找到句子的所有解析，但肯定必须有一种程序方法来解决这个问题？

Answer 1

也许 space 你的例子稍微有点帮助：

(a , b , c , d)
(a , b , c   d)
(a , b   c , d)
(a   b , c , d)
(a   b , c   d)
(a , b   c   d)
(a   b   c , d)
(a   b   c   d)  # added for completeness

看一下，很明显区分行的是逗号的存在与否，这是一种典型的二元选择。一个逗号可以去三个地方，所以有八种可能，对应三位数的八个二进制数。

列出这些可能性的最简单方法是从 0 0 0 数到 1 1 1。

对于你修改的问题，其中有一个部分的最大长度，Python 中的一个简单递归解决方案是：

def kgram(k, v):
    'Generate all partitions of v with parts no larger than k'
    def helper(sfx, m):
        if m == 0: yield sfx
        else:
            for i in range(1, min(k, m)+1):
                yield from helper([v[m-i:m]]+sfx, m-i)

    yield from helper([], len(v))

这是一个快速测试：

>>> for p in gram(3, 'one two three four five'.split()): print(p)
... 
[['one'], ['two'], ['three'], ['four'], ['five']]
[['one', 'two'], ['three'], ['four'], ['five']]
[['one'], ['two', 'three'], ['four'], ['five']]
[['one', 'two', 'three'], ['four'], ['five']]
[['one'], ['two'], ['three', 'four'], ['five']]
[['one', 'two'], ['three', 'four'], ['five']]
[['one'], ['two', 'three', 'four'], ['five']]
[['one'], ['two'], ['three'], ['four', 'five']]
[['one', 'two'], ['three'], ['four', 'five']]
[['one'], ['two', 'three'], ['four', 'five']]
[['one', 'two', 'three'], ['four', 'five']]
[['one'], ['two'], ['three', 'four', 'five']]
[['one', 'two'], ['three', 'four', 'five']]

按顺序混合非重叠的每克

Mixing non-overlapping everygrams in order

grammar

nlp

nltk

n-gram

python-3.x