替换忽略特定单词的所有连续重复字母

Question

我看到很多建议使用 re（正则表达式）或 .join 在 python 中删除句子中连续重复的字母，但我希望特殊词例外。

例如：

我要这句话>sentence = 'hello, join this meeting heere using thiis lllink'

变成这样>'hello, join this meeting here using this link'

知道我有这个单词列表要保留并忽略重复字母检查：keepWord = ['Hello','meeting']

我发现有用的两个脚本是：

使用.join:

import itertools

sentence = ''.join(c[0] for c in itertools.groupby(sentence))

使用正则表达式：

import re

sentence = re.compile(r'(.){1,}').sub(r'', sentence)

我有一个解决方案，但我认为还有一个更紧凑、更高效的解决方案。我现在的解决方案是：

import itertools

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']

new_sentence = ''

for word in sentence.split():
    if word not in keepWord:
        new_word = ''.join(c[0] for c in itertools.groupby(word))
        new_sentence = sentence +" " + new_word
    else:
        new_sentence = sentence +" " + word

有什么建议吗？

Answer 1

您可以匹配 keepWord 列表中的整个单词，并且只替换其他上下文中两个或更多相同字母的序列：

import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link

见Python demo

正则表达式看起来像

\b(?:hello|meeting)\b|([^\W\d_])+

见regex demo。如果第 1 组匹配，则返回其值，否则，放回完整匹配项（要保留的单词）。

图案详情

\b(?:hello|meeting)\b - hello 或 meeting 包含在单词边界
| - 或
([^\W\d_]) - 第 1 组：任何 Unicode 字母
+ - 对第 1 组值的一个或多个反向引用

Answer 2

虽然不是特别紧凑，但这是一个使用正则表达式的相当简单的示例：函数 subst 将用单个字符替换重复的字符，然后使用 re.sub 来调用它它找到的每个单词。

这里假设因为您的示例 keepWord 列表（第一次提到的地方）在标题中有 Hello 但文本在小写中有 hello，所以您想要对列表执行 case-insensitive 比较。因此，无论您的句子包含 Hello 还是 hello.

，它都同样有效

import re

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['Hello','meeting']

keepWord_s = set(word.lower() for word in keepWord)

def subst(match):
    word = match.group(0)
    return word if word.lower() in keepWord_s else re.sub(r'(.)+', r'', word)

print(re.sub(r'\b.+?\b', subst, sentence))

给出：

hello, join this meeting here using this link

替换忽略特定单词的所有连续重复字母

Replace all consecutive repeated letters ignoring specific words

python

regex

text

preprocessor