大写时保留段落标记 - RegEx

Preserve paragraph marks while capitalzing - RegEx

p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
def cap(match):
    return(match.group().capitalize())
capitalized_1 = p.sub(cap, Inputfile)

with codecs.open('o.txt', mode="w", encoding="utf_8") as file:
  file.write(capitalized_1)

我正在使用 Regex 将 . 之后的字母大写。 ? !上面的代码就是这样做的。但是它去掉了段落标记(page break pilcrow)并将其合并为一个大段落。

如何保留段落标记并防止结块?

输入文件:

on the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. you can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. when you create pictures, charts, or diagrams, they also coordinate with your current document look.

you can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. you can also format text directly by using the other controls on the home tab. most controls offer a choice of using the look from the current theme or using a format that you specify directly.

当前输出

On the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look. You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. You can also format text directly by using the other controls on the home tab. most controls offer a choice of using the look from the current theme or using a format that you specify directly.

预期输出:

On the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.

You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. You can also format text directly by using the other controls on the home tab. Most controls offer a choice of using the look from the current theme or using a format that you specify directly.

编辑 1:

import re,codecs
def capitalize(match):
    return ''.join([match.group(1), match.group(2).capitalize()])

with codecs.open('i.txt', encoding='utf-8') as f:
    text = f.read()
    
pattern = re.compile('(^|[.?!]\s+)(\w+)?')

print(pattern.sub(capitalize, text))

当我尝试根据答案 1 方法从文件中读取它时抛出错误。

return ''.join([match.group(1), match.group(2).capitalize()])
AttributeError: 'NoneType' object has no attribute 'capitalize'

你可以这样做:

import re


def capitalize(match):
    return ''.join([match.group(1), match.group(2).capitalize()])

text = """on the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. you can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. when you create pictures, charts, or diagrams, they also coordinate with your current document look.

you can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. you can also format text directly by using the other controls on the home tab. most controls offer a choice of using the look from the current theme or using a format that you specify directly."""

pattern = re.compile('(^|[.?!]\s+)(\w+)?')

print(pattern.sub(capitalize, text))

输出

On the insert tab, the galleries include items that are designed to coordinate with the overall look of your document. You can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look.

You can easily change the formatting of selected text in the document text by choosing a look for the selected text from the quick styles gallery on the home tab. You can also format text directly by using the other controls on the home tab. Most controls offer a choice of using the look from the current theme or using a format that you specify directly.

备注

  • (^|[.?!]\s+) 表示捕获一个 .(点)、?! 后跟一个或多个白色 spaces 字符(制表符、space, 等等)。 ^ 表示字符串的开始;所以完整的这个组意味着句子的开头或 .?! 后跟白色 space.
  • (\w+)?表示一个或多个单词字符
  • capitalize 函数然后保留第一组匹配的内容并将第二组(单词)大写。