按顺序混合非重叠的每克
Mixing non-overlapping everygrams in order
我正在寻找一种方法来从匹配输入句子的长度为 n
的每克生成序列:
给定一个句子:"Break this into sequences"
和n = 3
我想创建序列:
("Break", "this", "into", "sequences")
("Break", "this", "into sequences")
("Break", "this into", "sequences")
("Break this", "into", "sequences")
("Break this", "into sequences")
("Break", "this into sequences")
("Break this into", "sequences")
nltk
有 everygram
包,但我不太确定如何使用它来实现我的目标。
为简单起见,我尝试调整问题以关注字符,即
将这些视为字符语法可能会有所帮助(并且,如 rici 所建议的那样,分隔字符 [为清楚起见,显示有和没有间距]):
abcd
转到:
(a, b, c, d) (a, b, c, d)
(a, b, c d) (a, b, cd)
(a, b c, d) (a, bc, d)
(a b, c, d) (ab, c, d)
(a b, c d) (ab, cd)
(a, b c d) (a, bcd)
(a b c, d) (abc, d)
为清楚起见,给定一个 n
作为最大大小的 n-gram,这应该概括为任何长度;所以,对于 abcde
和 n=3
我们有:
(a, b, c, d, e) (a, b, c, d, e)
(a, b, c, d e) (a, b, c, de)
(a, b, c d, e) (a, b, cd, e)
(a, b c, d e) (a, bc, d, e)
(a b, c, d, e) (ab, c, d, e)
(a, b c, d e) (a, bc, de)
(a b, c, d e) (ab, c, de)
(a b, c d, e) (ab, cd, e)
(a, b, c d e) (a, b, cde)
(a, b c d, e) (a, bcd, e)
(a b c, d, e) (abc, d, e)
(a b, c d e) (ab, cde)
(a b c, d e) (abc, de)
我想我可能需要生成一个语法,例如:
exp ::= ABC, d | a, BCD
ABC ::= AB, c | A, BC
BCD ::= BC, d | b, CD
AB ::= A, b | a, B
BC ::= B, c | b, C
CD ::= C, d | c, D
A ::= a
B ::= b
C ::= c
D ::= d
并找到句子的所有解析,但肯定必须有一种程序方法来解决这个问题?
也许 space 你的例子稍微有点帮助:
(a , b , c , d)
(a , b , c d)
(a , b c , d)
(a b , c , d)
(a b , c d)
(a , b c d)
(a b c , d)
(a b c d) # added for completeness
看一下,很明显区分行的是逗号的存在与否,这是一种典型的二元选择。一个逗号可以去三个地方,所以有八种可能,对应三位数的八个二进制数。
列出这些可能性的最简单方法是从 0 0 0
数到 1 1 1
。
对于你修改的问题,其中有一个部分的最大长度,Python 中的一个简单递归解决方案是:
def kgram(k, v):
'Generate all partitions of v with parts no larger than k'
def helper(sfx, m):
if m == 0: yield sfx
else:
for i in range(1, min(k, m)+1):
yield from helper([v[m-i:m]]+sfx, m-i)
yield from helper([], len(v))
这是一个快速测试:
>>> for p in gram(3, 'one two three four five'.split()): print(p)
...
[['one'], ['two'], ['three'], ['four'], ['five']]
[['one', 'two'], ['three'], ['four'], ['five']]
[['one'], ['two', 'three'], ['four'], ['five']]
[['one', 'two', 'three'], ['four'], ['five']]
[['one'], ['two'], ['three', 'four'], ['five']]
[['one', 'two'], ['three', 'four'], ['five']]
[['one'], ['two', 'three', 'four'], ['five']]
[['one'], ['two'], ['three'], ['four', 'five']]
[['one', 'two'], ['three'], ['four', 'five']]
[['one'], ['two', 'three'], ['four', 'five']]
[['one', 'two', 'three'], ['four', 'five']]
[['one'], ['two'], ['three', 'four', 'five']]
[['one', 'two'], ['three', 'four', 'five']]
我正在寻找一种方法来从匹配输入句子的长度为 n
的每克生成序列:
给定一个句子:"Break this into sequences"
和n = 3
我想创建序列:
("Break", "this", "into", "sequences")
("Break", "this", "into sequences")
("Break", "this into", "sequences")
("Break this", "into", "sequences")
("Break this", "into sequences")
("Break", "this into sequences")
("Break this into", "sequences")
nltk
有 everygram
包,但我不太确定如何使用它来实现我的目标。
为简单起见,我尝试调整问题以关注字符,即
将这些视为字符语法可能会有所帮助(并且,如 rici 所建议的那样,分隔字符 [为清楚起见,显示有和没有间距]):
abcd
转到:
(a, b, c, d) (a, b, c, d)
(a, b, c d) (a, b, cd)
(a, b c, d) (a, bc, d)
(a b, c, d) (ab, c, d)
(a b, c d) (ab, cd)
(a, b c d) (a, bcd)
(a b c, d) (abc, d)
为清楚起见,给定一个 n
作为最大大小的 n-gram,这应该概括为任何长度;所以,对于 abcde
和 n=3
我们有:
(a, b, c, d, e) (a, b, c, d, e)
(a, b, c, d e) (a, b, c, de)
(a, b, c d, e) (a, b, cd, e)
(a, b c, d e) (a, bc, d, e)
(a b, c, d, e) (ab, c, d, e)
(a, b c, d e) (a, bc, de)
(a b, c, d e) (ab, c, de)
(a b, c d, e) (ab, cd, e)
(a, b, c d e) (a, b, cde)
(a, b c d, e) (a, bcd, e)
(a b c, d, e) (abc, d, e)
(a b, c d e) (ab, cde)
(a b c, d e) (abc, de)
我想我可能需要生成一个语法,例如:
exp ::= ABC, d | a, BCD
ABC ::= AB, c | A, BC
BCD ::= BC, d | b, CD
AB ::= A, b | a, B
BC ::= B, c | b, C
CD ::= C, d | c, D
A ::= a
B ::= b
C ::= c
D ::= d
并找到句子的所有解析,但肯定必须有一种程序方法来解决这个问题?
也许 space 你的例子稍微有点帮助:
(a , b , c , d)
(a , b , c d)
(a , b c , d)
(a b , c , d)
(a b , c d)
(a , b c d)
(a b c , d)
(a b c d) # added for completeness
看一下,很明显区分行的是逗号的存在与否,这是一种典型的二元选择。一个逗号可以去三个地方,所以有八种可能,对应三位数的八个二进制数。
列出这些可能性的最简单方法是从 0 0 0
数到 1 1 1
。
对于你修改的问题,其中有一个部分的最大长度,Python 中的一个简单递归解决方案是:
def kgram(k, v):
'Generate all partitions of v with parts no larger than k'
def helper(sfx, m):
if m == 0: yield sfx
else:
for i in range(1, min(k, m)+1):
yield from helper([v[m-i:m]]+sfx, m-i)
yield from helper([], len(v))
这是一个快速测试:
>>> for p in gram(3, 'one two three four five'.split()): print(p)
...
[['one'], ['two'], ['three'], ['four'], ['five']]
[['one', 'two'], ['three'], ['four'], ['five']]
[['one'], ['two', 'three'], ['four'], ['five']]
[['one', 'two', 'three'], ['four'], ['five']]
[['one'], ['two'], ['three', 'four'], ['five']]
[['one', 'two'], ['three', 'four'], ['five']]
[['one'], ['two', 'three', 'four'], ['five']]
[['one'], ['two'], ['three'], ['four', 'five']]
[['one', 'two'], ['three'], ['four', 'five']]
[['one'], ['two', 'three'], ['four', 'five']]
[['one', 'two', 'three'], ['four', 'five']]
[['one'], ['two'], ['three', 'four', 'five']]
[['one', 'two'], ['three', 'four', 'five']]