Python 正则表达式 - 用标点符号快速替换多个关键字并以

Question

这是的扩展。

我有一本 python 字典，是这样制作的

a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

我想找到一个解决方案，尽快用它们的键替换字典值中的所有单词。解决方案应该是可扩展的大文本。如果单词以星号结尾，则意味着应替换文本中以该前缀开头的所有单词。

所以下面的句子“我一直很糟糕，但我渴望成为一个更好的人，并且表现得像我的dog and cat :)”应该转化为“XXX不好但是我XXX是一个更好的人，表现得像我的 动物 XXX"。

我正在尝试为此使用 trrex，认为它应该是最快的选择。是吗？但是我不能成功。而且我发现问题：

处理包含标点符号的单词（例如“:)”和“我去过”）；
当某些字符串重复出现时，例如“dog”和“dog and cat”。

你能用可扩展的解决方案帮助我实现我的目标吗？

Answer 1

您可以调整以满足您的需要：

从 a 创建另一个字典，它将包含相同的键和从值创建的正则表达式
如果找到 * 字符，如果您指的是任何零个或多个单词字符，请将其替换为 \w*，或者如果您指的是任何零个或多个非字符字符，请使用 \S*空白字符（请调整 def quote(self, char) 方法），否则，引用 char
使用明确的单词边界，(?<!\w) 和 (?!\w)，如果它们干扰匹配的非单词条目，则将它们完全删除
这里的第一个正则表达式看起来像 (?<!\w)(?:cat|dog(?:\ and\ cat)?)(?!\w) (demo) and the second will look like (?<!\w)(?::\)|I've\ been|asp\w*)(?!\w) (demo)
循环替换。

参见 Python demo:

import re

# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}

class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""
    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        if char == '*':
            return r'\w*'
        else:
            return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

# Creating patterns
a2 = {}
for k,v in a.items():
    trie = Trie()
    for w in v:
        trie.add(w)
    a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)

for k,r in a2.items():
    text = r.sub(k, text)
    
print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX

Python 正则表达式 - 用标点符号快速替换多个关键字并以

Python Regex - Fast replace of multiple keywords with punctuation and starting with

python

regex

string

full-text-search

replace