从词汇表中替换字符串的有效方法 - Python

Question

我有一个短语词汇表，我想用这些词替换另一个文件中的词。例如，我有以下词汇表：

美国，纽约

我想替换以下文件：

"I work for New York but I don't even live at the United States"

为此：

"I work for New_York but I don't even live at the United_States"

目前我是这样做的：

import os

def _check_files_and_write_phrases(docs, worker_num):
    print("worker ", worker_num," started!")
    for i, file in enumerate(docs):
        file_path = DOCS_FOLDER + file
        with open(file_path) as f:
            text = f.read()
            for phrase in phrases:
                text = text.replace(phrase, phrase.replace(' ','_'))
            new_doc = PHRASES_DOCS_FOLDER + file[:-4] + '_phrases.txt'
            with open(new_doc, 'w') as nf:
                nf.write(text)

    print("job done on worker ", worker_num)


docs = os.listdir(DOCS_FOLDER)

import threading

threads = []
for i in range(1, 11):
    print(i)
    start = int((len(docs)/10) * (i - 1))
    end = int((len(docs)/10) * (i))
    print(start,end)
    if i != 10:
        t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:end], i, ))
    else:
        t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:], i, ))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print("all workers finished!")

但是速度太慢了！我以为线程可以完成这项工作，但我错了...

还有其他高效方法吗？

Answer 1

尝试更改 for 循环以仅替换文本中存在的短语：

for phrase in set(phrases).intersection(text.split()):
...

尝试使用和不使用线程。

Answer 2

可以使用单个 re.sub() 调用替换所有短语，该调用可以预编译以进一步加快速度：

import re

phrases = {"United States":"United_States", "New York":"New_York"}
re_replace = re.compile(r'\b({})\b'.format('|'.join(re.escape(phrase) for phrase in phrases.keys())))

def _check_files_and_write_phrases(docs, worker_num):
    print("worker {} started!".format(worker_num))

    for i, filename in enumerate(docs):
        file_path = DOCS_FOLDER + filename

        with open(file_path) as f:
            text = f.read()
            text = re_replace.sub(lambda x: phrases[x.group(1)], text)
            new_doc = PHRASES_DOCS_FOLDER + filename[:-4] + '_phrases.txt'

            with open(new_doc, 'w') as nf:
                nf.write(text)

    print("job done on worker ", worker_num)

这首先创建一个正则表达式，根据短语字典进行如下搜索：

\b(United\ States|New\ York)\b

re.sub() 函数然后使用 phrases 词典查找所需的短语替换。它有两个参数，替换和原始文本。替换可以是固定字符串，或者在这种情况下使用函数。该函数采用单个参数作为匹配对象，returns 替换文本。 lambda 函数用于执行此操作，它只是在 phrases 字典中查找匹配对象。

与其进行字典查找，不如在此处使用 replace()，但预先计算的替换文本应该更快。添加 \b 是为了仅进行单词边界上的替换，因此例如 MYNew York 将被跳过。如果需要，将 flags=re.I 添加到 re.compile() 可用于使搜索不区分大小写。

从词汇表中替换字符串的有效方法 - Python

Efficient way for replace strings from a vocabulary - Python

python

regex

multithreading

python-multithreading