通过确保单词的完整性将文本分成块

Question

我有一堆文本样本。每个样本都有不同的长度，但它们都包含 >200 个字符。我需要将每个样本拆分成大约 50 个字符长度的子字符串。为此，我找到了这种方法：

import re

def chunkstring(string, length):
    return re.findall('.{%d}' % length, string)

但是，它通过拆分单词来拆分文本。例如，短语“I have <...> icecream.<...>”可以拆分为“I have <...> icec”和“ream.<...>”。

这是示例文本：

This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN.

我得到这个结果：

['This paper proposes a method that allows non-paral',
 'lel many-to-many voice conversion by using a varia',
 'nt of a generative adversarial network called Star']

但理想情况下，我希望得到类似于此结果的结果：

['This paper proposes a method that allows non-parallel',
 'many-to-many voice conversion by using a variant',
 'of a generative adversarial network called StarGAN.']

我如何调整上面给出的代码以获得想要的结果？

Answer 1

您可以使用 .{0,50}\S* 以继续匹配最终的非 space 字符 (\S)。

我将 0 指定为下限，否则您可能会错过最后一个子字符串。

查看演示 here。

编辑：

为了排除尾随的空块，使用.{1,50}\S*，以强制它匹配至少一个字符。

如果您还想自动剥离边 spaces，请使用 \s*(.{1,50}\S*)。

Answer 2

对我来说，这听起来像是 textwrap 内置模块的任务，例如使用您的数据

import textwrap
text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
print(textwrap.fill(text,55))

输出

This paper proposes a method that allows non-parallel
many-to-many voice conversion by using a variant of a
generative adversarial network called StarGAN.

您可能需要进行一些试验才能获得最适合您需求的价值。如果您需要 str 中的 list，请使用 textwrap.wrap，即 textwrap.wrap(text,55)

通过确保单词的完整性将文本分成块

Split text into chunks by ensuring the entireness of words

python

python-re