根据存在删除

Question

我正在尝试分析一篇文章以确定是否出现了特定的子字符串。

如果出现"Bill"，那么我想从文章中删除子字符串的父句子，以及第一个删除句子之后的所有句子。

如果"Bill"没有出现，则不对文章做任何改动。

示例文本：

stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps. 

This is Bill, signing off. Thank you for reading. And see you tomorrow!"""

目标子字符串为“Bill”时的预期结果：

stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""

这是目前的代码：

if "Bill" not in stringy[-200:]:
    print(stringy)

text = stringy.rsplit("Bill")[0]

text = text.split('.')[:-1]

text = '.'.join(text) + '.'

当 "Bill" 出现在最后 200 个字符之外时，它目前不起作用，在 "Bill" 的第一个实例（开头的句子，"This is Bill Everest here" ).如何将此代码更改为仅 select for "Bill"s in the last 200 characters?

Answer 1

以下是如何使用 re:

import re

stringy = """..."""
target = "Bill"

l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)

for i in range(len(l)-1,0,-1):
    if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
        strings = ' '.join(l[:i])

print(stringy)

Answer 2

这是另一种使用正则表达式遍历每个句子的方法。我们保留行数，一旦进入最后 200 个字符，我们就会检查行中是否有 'Bill'。如果找到，我们将从此行中排除。

希望代码足够可读。

import re

def remove_bill(stringy):
    sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
    total = len(stringy)
    count = 0
    for index, line in enumerate(sentences):
        #Check each index of 'Bill' in line
        for pos in (m.start() for m in re.finditer('Bill', line)):
            if count + pos >= total - 200:
                stringy = ''.join(sentences[:index])
                return stringy
        count += len(line)
    return stringy

stringy = remove_bill(stringy)

根据存在删除

Delete based on presence

python

algorithm

substring

python-3.x

python-re