按顺序生成所有可能的子串

Generate all possible substrings in sequence

我正在 Python

中搜索库或实现以下内容的有效方法
Input: 
"He was hungry"

Desired Output:
[["He","was","hungry"]
["He was","hungry"]
["He","was hungry"]
["He was hungry"]]

既然你提到了 n-gram,这可能就是你想要的:

s = "He was hungry"
from sklearn.feature_extraction.text import CountVectorizer
c = CountVectorizer(ngram_range=(1, len(s))).fit([s])
c.vocabulary_

{'he': 0,
 'was': 4,
 'hungry': 3,
 'he was': 1,
 'was hungry': 5,
 'he was hungry': 2}

这是一种递归方法:对于包含 N 个单词的输入,计算前 N-1 个单词的可能连接,然后选择是将最后一个单词作为其自己的元素附加还是将其与最右边的元素连接。

def iter_joinings(items):
    if len(items) == 0:
        return
    elif len(items) == 1:
        yield items
    else:
        right = items[-1]
        for left_a in iter_joinings(items[:-1]):
            left_b = left_a.copy()
            left_a.append(right)
            yield left_a
            left_b[-1] = left_b[-1] + " " + right
            yield left_b

s = "He was hungry"
for result in iter_joinings(s.split()):
    print(result)

结果:

['He', 'was', 'hungry']
['He', 'was hungry']
['He was', 'hungry']
['He was hungry']

这是一个迭代版本,以防万一您有 999 个元素的输入并且不想达到 Python 的最大递归深度:

import itertools

def iter_joinings(items):
    for decisions in itertools.product((False, True), repeat=len(items)-1):
        result = [items[0]]
        for idx, should_append in enumerate(decisions, 1):
            if should_append:
                result.append(items[idx])
            else:
                result[-1] = result[-1] + " " + items[idx]
        yield result

s = "He was hungry"
for result in iter_joinings(s.split()):
    print(result)

...尽管如此巨大的输入在任何一种情况下都需要大约 10^300 条字节码指令来执行,所以这不太可能成为实际问题。

def f(a):
    if(len(a) == 0):
        yield []
    for i in range(len(a)):
        for c in f(a[i+1:]):
            yield [" ".join(a[:i+1]), *c]


s = "He was hungry"
print(list(f(s.split())))

[['He', 'was', 'hungry'], ['He', 'was hungry'], ['He was', 'hungry'], ['He was hungry']]