从包含给定最大单词数的字符串列表创建子列表

creating sub-lists from list of strings containing given number of words maximum

我有一个字符串列表。我想从列表中创建子列表,使其包含原始列表中的字符串,但每个子列表中的单词数应小于 16,并且前面的字符串应为子列表的第一个元素列表,第一个子列表除外。

举个例子,假设我的列表如下,包含 5 个字符串,每个字符串包含不同数量的单词。

qq = ['blended e learning forumin planning', 'difficulties of learning as forigen language', 'difficulties of grammar', 'students difficulties in grammar', 'difficulties of english grammar']

我想创建满足上述条件的子列表,最多 16 个单词,每个列表包含前面的字符串作为第一个元素(第一个子列表除外)。只有两个子列表,我的输出将是

q1 = ['blended e learning forumin planning', 'difficulties of learning as forigen language', 'difficulties of grammar']
q2 = ['difficulties of grammar', 'students difficulties in grammar', 'difficulties of english grammar']

这是我试过的。它是否正确,是否有更好的方法来做到这一点?我有数百万个列表来执行此操作。

qq = ['blended e learning forumin planning', 'difficulties of learning as forigen language', 'difficulties of grammar', 'students difficulties in grammar', 'difficulties of english grammar']

psz = 0
pi = 0
msz = 16
subqq = list()
qq_i = list()

for i in range(len(qq)):
    csz=psz+len(qq[i].split())
    if (csz>msz):
        subqq.append(qq_i.copy())
        qq_i.clear()
        qq_i.append(qq[i-1])
        qq_i.append(qq[i])
        psz = 0
    else:
        qq_i.append(qq[i])
        psz += len(qq[i].split())

subqq.append(qq_i)

我的和你的很相似,算法也基本相同,但我相信这应该运行稍微快一点:

def fn(lst, n):
    word_count = 0
    res = []
    temp_lst = []

    for item in lst:
        len_current_item = len(item.split())
        word_count += len_current_item

        if word_count < n:
            temp_lst.append(item)

        else:
            res.append(temp_lst)
            last_item = res[-1][-1]
            temp_lst = [last_item, item]
            word_count = len_current_item + len(last_item.split())

    res.append(temp_lst)

    # Checking for last item's lenght as Phydeaux pointed out in comments.
    if word_count > n:
        res.append([temp_lst.pop()])

    return res

输出:

['blended e learning forumin planning', 'difficulties of learning as forigen language', 'difficulties of grammar']
['difficulties of grammar', 'students difficulties in grammar', 'difficulties of english grammar']

我尽量避免复制和清除,并进行了一些小改动。

这是我想到的。它与您的算法和 SorousH Bakhtiary 的答案类似,但应该没有字数错误,我认为它更容易阅读。

如果我们用前一个中的最后一个短语开始一个新的子列表并且不能在不打破字数限制的情况下添加下一个短语,它也会引发错误。如果有两个连续的短语超过 8 个单词,则可能会发生这种情况 - 如果您可以确定永远不会发生,那么您可以省略该部分。

def count_words(phrase):
    return len(phrase.split())


def sublists_with_max_words(main_list, max_words=16):
    output_sublists = []

    current_sublist = []
    current_sublist_words = 0

    for phrase in main_list:
        words_in_phrase = count_words(phrase)

        if (current_sublist_words + words_in_phrase) > max_words:
            # If we cannot add the phrase to the sublist without breaking
            # the word limit, then add the sublist to the output
            output_sublists.append(current_sublist)

            # Start a new sublist with the last phrase we added
            last_phrase = current_sublist[-1]
            current_sublist = [last_phrase]
            current_sublist_words = count_words(last_phrase)

            # If we cannot add the phrase to the new sublist either, then raise
            # an exception as we cannot continue without breaking the word limit
            if (current_sublist_words + words_in_phrase) > max_words:
                raise ValueError(
                    f"Cannot add '{phrase}' ({words_in_phrase} words) to a new"
                    f" sublist with {current_sublist_words} words"
                )

        # Add the current phrase to the sublist
        current_sublist.append(phrase)
        current_sublist_words += words_in_phrase

    # At the end of the loop, add the working sublist to the output
    output_sublists.append(current_sublist)

    return output_sublists


print(sublists_with_max_words(qq))

由于其他人已经提供了有效的解决方案,这里有另一种有趣的方法,通过每个 qq 元素和相应的累积单词数之间的 映射方案 实现。

首先创建映射字典:

qq_map = {q: len(" ".join(qq[:n+1]).split()) for n, q in enumerate(qq)}

# {'blended e learning forumin planning': 5, 'difficulties of learning as forigen language': 11,
# 'difficulties of grammar': 14, 'students difficulties in grammar': 18,
# 'difficulties of english grammar': 22}

然后使用映射信息构建分组列表:

qq = [[q for q in qq if qq_map[q] in range(i*16, (i+1)*16)] /
     for i in range(-(-qq_map[qq[-1]] // 16))]

# [['blended e learning forumin planning', 'difficulties of learning as forigen language', 'difficulties of grammar'], 
# ['students difficulties in grammar', 'difficulties of english grammar']]

Note: -(-qq_map[qq[-1]] // 16) is an equivalent to math.ceil(qq[-1] / 16). You can replace it if you'd like a more concise and less 'arithmetic' expression.

最后,您再次处理列表,以便将每个组的最后一个字符串推送到下一个字符串中(当然第一个除外):

qq = [[qq[i-1][-1]] + qq[i] if i != 0 else qq[i] for i in range(len(qq))]

# [['blended e learning forumin planning', 'difficulties of learning as forigen language', 'difficulties of grammar'], 
# ['difficulties of grammar', 'students difficulties in grammar', 'difficulties of english grammar']]