如何 textwrap.fill 文本,但防止特定单词出现在行首?

How to textwrap.fill text, but prevent specific words from being at the start of the lines?

我有 python 读取长行,如果它们超过 x 个字符则换行并将它们写入新文件。我想出了如何确保单词不会分开的方法,但我有一个更具体的问题。我不希望特定的单词出现在一行的开头。经过几个小时的研究,我意识到我 运行 走上了错误的道路来解决这个问题,需要帮助。

这是我现在的代码:

with txtfile as infile, testfile as outfile:
    for line in infile:
        if len(line) > 80 and any(word in line[77:] for word in connectives):
            outfile.write(textwrap.fill(line,96,replace_whitespace=False))
        elif len(line) > 80 and not any(word in line[77:] for word in connectives):
            outfile.write(textwrap.fill(line,80,replace_whitespace=False))
        else:
            outfile.write(line)

对我尝试做的事情的一点解释:现在它读取一行几百个字符,如果它超过 80 个字符,它会将它换行到 80 个。我想我会看看最后一个该行的几个字符包含我要定位的任何单词,如果是这样,我会延长这些行的换行,这样目标单词就不会掉到下一行。 但是我意识到我的想法是错误的(也许 moronic 更好),因为 if 语句检查了几百个字符的第一行。它不会在包装时检查后续行。最后,我可以避免在第一行打错字,但后面的行就不行了。

既然 textwrap 如果你不想要它也不会分解整个单词,我希望有一种方法可以告诉它不允许某些单词或字符被丢弃到下一行。

或者,也许有一种方法可以读取包装的内容,并在任何时候将特定单词作为该行的第一个单词出现,然后将其移至上一行的末尾。

您或许可以破解 textwrap 来做您想做的事,同时这里的代码片段可以做您想做的事。基本 word-wrapping 代码是维基百科文章标题部分中算法的改编:Line wrap and word wrap.

当遇到不能位于下一行开头的单词时,它们只会被添加到当前行(从技术上讲,这会使其太长)。如果您觉得不可接受,至少这会为您提供 code-base 尝试其他方法的机会。

import re

def textsplitter(text):
    for match_obj in re.finditer(r'\w+\S+', sample_text):
        match_str = match_obj.group()
        submatch_obj = re.match(r'(\w+)(\S*)', match_str)
        yield submatch_obj.groups()

def textwrapper(text, width=79, **kwargs):
    taboo = set(kwargs.get('taboo', []))  # Words that can't be first.
    result = []
    spaceleft = width

    for word, suffix in textsplitter(text):
        phrase = word + suffix  # Note suffix might be empty string ''.

        if word in taboo:   # Can't be first, so just add it.
            result.append(phrase)
            spaceleft = 0
        else:               # Add word, possibly with an inserted linebreak.
            if len(phrase) > spaceleft:
                result.append('\n'+phrase)  # Insert linebreak before word.
                spaceleft = width - len(phrase)
            else:
                result.append(phrase)
                spaceleft = spaceleft - (len(phrase) + 1)

    return ' '.join(result)


sample_text = """\
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In molestie lectus
nulla, at aliquam dolor suscipit ac. Mauris vitae purus non est vehicula dictum.
Integer varius diam tellus, quis cursus lacus sollicitudin sed. Nulla eu quam
nec felis egestas tristique eu placerat est. Praesent tincidunt libero in
aliquet euismod. Pellentesque eu odio mollis, consequat eros in, vestibulum
mauris. Aenean gravida dolor et ligula cursus laoreet.
"""

print('Wrapped with no taboo words:\n')
print(textwrapper(sample_text, 40))

print('\n'*2)
taboo = ['adipiscing', 'aliquam']  # Not allowed to appear at start of lines.
print('Wrapped again with taboo words {}:\n'.format(taboo))
print(textwrapper(sample_text, 40, taboo=taboo))

输出:

Wrapped with no taboo words:

Lorem ipsum dolor sit amet, consectetur
adipiscing elit. In molestie lectus
nulla, at aliquam dolor suscipit ac.
Mauris vitae purus non est vehicula
dictum. Integer varius diam tellus, quis
cursus lacus sollicitudin sed. Nulla eu
quam nec felis egestas tristique eu
placerat est. Praesent tincidunt libero
in aliquet euismod. Pellentesque eu odio
mollis, consequat eros in, vestibulum
mauris. Aenean gravida dolor et ligula
cursus laoreet.


Wrapped again with taboo words ['adipiscing', 'aliquam']:

Lorem ipsum dolor sit amet, consectetur adipiscing
elit. In molestie lectus nulla, at aliquam
dolor suscipit ac. Mauris vitae purus non
est vehicula dictum. Integer varius diam
tellus, quis cursus lacus sollicitudin
sed. Nulla eu quam nec felis egestas
tristique eu placerat est. Praesent
tincidunt libero in aliquet euismod.
Pellentesque eu odio mollis, consequat
eros in, vestibulum mauris. Aenean
gravida dolor et ligula cursus laoreet.