用 python 中的字符串替换 sentences/paragraphs 的最佳方法

Question

如何用文本文件中的 <string> 标记替换所有句子和段落？

我想在文本文档中保持间距、制表符和列表不变：

示例输入：

Clause 1:

  a) detail 1. some more about detail 1. Here is more information about this paragraph right here. There is more information that we think sometimes.

  b) detail 2. some more about detail 2. and some more..

示例输出：

<string>

  a) <string>

  b) <string>

Answer 1

使用re模块：

>>> import re
>>> text = 'Aaaaaaaaaaaaaaa,     to replace!\n to replace?\n\thelll34234ooooo'
>>> re.sub(r'(\w+)', '<string>', text)

它输出：

>>> '<string>,     <string> <string>!\n <string> <string>?\n\t<string>'

re.sub 表示：用 <string> 替换 text 中每个出现的 (\w+)。

对于文件：

main.py:

import re

with open('main.py', 'r') as input:
    text = input.read()
    print(text, '\n\n----------------\n')
    print(re.sub(r'(\w+)', '<string>', text))

输出：

import re

with open('main.py', 'r') as input:
    text = input.read()
    print(text, '\n\n----------------\n')
    print(re.sub(r'(\w+)', '<string>', text)) 

----------------

<string> <string>

<string> <string>('<string>.<string>', '<string>') <string> <string>:
    <string> = <string>.<string>()
    <string>(<string>, '\<string>\<string>----------------\<string>')
    <string>(<string>.<string>(<string>'(\<string>+)', '<<string>>', 
<string>))

Answer 2

我不知道这是否是最好的方式，但它相当简单，而且易于修改。它处理您的问题陈述中的示例，以及您评论中的大部分示例。

import sys, re

text = sys.stdin.read()

# A pattern expressing the parts of the input that we want to preserve:
keeper_pattern = r'''(?x)  # verbose format

    (   # We put parens around the whole pattern
        # (and use ?: for subgroups)
        # so that when we use it as the splitter-pattern for re.split(),
        # the result contains one string for each occurrence of the pattern
        # (in addition to the usual between-splitter strings).

                    # The main thing we want to keep is paragraph-separators,
                    # and the 'lead' of the line that follows a para-sep:
                    #
        \n{2,}      # two or more newlines, followed by
        \x20*       # optional indentation (zero or more spaces), followed by
        (?:         # an optional item-marker, which is
          (?:         #   either
            \d+ \.    #       digits followed by a dot,
            |         #   or
            [a-z] \)  #       a letter followed by a right-paren,
          )           #   followed by
          \x20+       #   one or more spaces.
        )?

        |
                    # The other thing we want to keep is
                    # item-markers within paragraphs:
                    #
        \( i+ \)    # a lower-case Roman numeral between parens
                    # (generalize as necessary)
    )
'''

for (i, chunk) in enumerate(re.split(keeper_pattern, text)):

    # In the result of re.split(),
    # the splitters (keepers) will be in the odd positions.
    is_keeper = (i % 2 == 1)

    if is_keeper:
        if chunk.startswith('\n'):
            # paragraph-separator etc
            replacement = chunk
        else:
            # within-para item-marker
            replacement = ' ' + chunk + ' '
    else:
        if chunk == '':
            # (happens if two keepers are adjacent)
            replacement = ''
        else:
            # everything else
            replacement = '<string>'

    sys.stdout.write(replacement)

用 python 中的字符串替换 sentences/paragraphs 的最佳方法

Best way to replace sentences/paragraphs with a string in python

python

text

text-processing

nlp

text-parsing