从词汇表中替换字符串的有效方法 - Python
Efficient way for replace strings from a vocabulary - Python
我有一个短语词汇表,我想用这些词替换另一个文件中的词。例如,我有以下词汇表:
美国,
纽约
我想替换以下文件:
"I work for New York but I don't even live at the United States"
为此:
"I work for New_York but I don't even live at the United_States"
目前我是这样做的:
import os
def _check_files_and_write_phrases(docs, worker_num):
print("worker ", worker_num," started!")
for i, file in enumerate(docs):
file_path = DOCS_FOLDER + file
with open(file_path) as f:
text = f.read()
for phrase in phrases:
text = text.replace(phrase, phrase.replace(' ','_'))
new_doc = PHRASES_DOCS_FOLDER + file[:-4] + '_phrases.txt'
with open(new_doc, 'w') as nf:
nf.write(text)
print("job done on worker ", worker_num)
docs = os.listdir(DOCS_FOLDER)
import threading
threads = []
for i in range(1, 11):
print(i)
start = int((len(docs)/10) * (i - 1))
end = int((len(docs)/10) * (i))
print(start,end)
if i != 10:
t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:end], i, ))
else:
t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:], i, ))
threads.append(t)
t.start()
for t in threads:
t.join()
print("all workers finished!")
但是速度太慢了!我以为线程可以完成这项工作,但我错了...
还有其他高效方法吗?
尝试更改 for
循环以仅替换文本中存在的短语:
for phrase in set(phrases).intersection(text.split()):
...
尝试使用和不使用线程。
可以使用单个 re.sub()
调用替换所有短语,该调用可以预编译以进一步加快速度:
import re
phrases = {"United States":"United_States", "New York":"New_York"}
re_replace = re.compile(r'\b({})\b'.format('|'.join(re.escape(phrase) for phrase in phrases.keys())))
def _check_files_and_write_phrases(docs, worker_num):
print("worker {} started!".format(worker_num))
for i, filename in enumerate(docs):
file_path = DOCS_FOLDER + filename
with open(file_path) as f:
text = f.read()
text = re_replace.sub(lambda x: phrases[x.group(1)], text)
new_doc = PHRASES_DOCS_FOLDER + filename[:-4] + '_phrases.txt'
with open(new_doc, 'w') as nf:
nf.write(text)
print("job done on worker ", worker_num)
这首先创建一个正则表达式,根据短语字典进行如下搜索:
\b(United\ States|New\ York)\b
re.sub()
函数然后使用 phrases
词典查找所需的短语替换。它有两个参数,替换和原始文本。替换可以是固定字符串,或者在这种情况下使用函数。该函数采用单个参数作为匹配对象,returns 替换文本。 lambda
函数用于执行此操作,它只是在 phrases
字典中查找匹配对象。
与其进行字典查找,不如在此处使用 replace()
,但预先计算的替换文本应该更快。添加 \b
是为了仅进行单词边界上的替换,因此例如 MYNew York
将被跳过。如果需要,将 flags=re.I
添加到 re.compile()
可用于使搜索不区分大小写。
我有一个短语词汇表,我想用这些词替换另一个文件中的词。例如,我有以下词汇表:
美国, 纽约
我想替换以下文件:
"I work for New York but I don't even live at the United States"
为此:
"I work for New_York but I don't even live at the United_States"
目前我是这样做的:
import os
def _check_files_and_write_phrases(docs, worker_num):
print("worker ", worker_num," started!")
for i, file in enumerate(docs):
file_path = DOCS_FOLDER + file
with open(file_path) as f:
text = f.read()
for phrase in phrases:
text = text.replace(phrase, phrase.replace(' ','_'))
new_doc = PHRASES_DOCS_FOLDER + file[:-4] + '_phrases.txt'
with open(new_doc, 'w') as nf:
nf.write(text)
print("job done on worker ", worker_num)
docs = os.listdir(DOCS_FOLDER)
import threading
threads = []
for i in range(1, 11):
print(i)
start = int((len(docs)/10) * (i - 1))
end = int((len(docs)/10) * (i))
print(start,end)
if i != 10:
t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:end], i, ))
else:
t = threading.Thread(target=_check_files_and_write_phrases, args=(docs[start:], i, ))
threads.append(t)
t.start()
for t in threads:
t.join()
print("all workers finished!")
但是速度太慢了!我以为线程可以完成这项工作,但我错了...
还有其他高效方法吗?
尝试更改 for
循环以仅替换文本中存在的短语:
for phrase in set(phrases).intersection(text.split()):
...
尝试使用和不使用线程。
可以使用单个 re.sub()
调用替换所有短语,该调用可以预编译以进一步加快速度:
import re
phrases = {"United States":"United_States", "New York":"New_York"}
re_replace = re.compile(r'\b({})\b'.format('|'.join(re.escape(phrase) for phrase in phrases.keys())))
def _check_files_and_write_phrases(docs, worker_num):
print("worker {} started!".format(worker_num))
for i, filename in enumerate(docs):
file_path = DOCS_FOLDER + filename
with open(file_path) as f:
text = f.read()
text = re_replace.sub(lambda x: phrases[x.group(1)], text)
new_doc = PHRASES_DOCS_FOLDER + filename[:-4] + '_phrases.txt'
with open(new_doc, 'w') as nf:
nf.write(text)
print("job done on worker ", worker_num)
这首先创建一个正则表达式,根据短语字典进行如下搜索:
\b(United\ States|New\ York)\b
re.sub()
函数然后使用 phrases
词典查找所需的短语替换。它有两个参数,替换和原始文本。替换可以是固定字符串,或者在这种情况下使用函数。该函数采用单个参数作为匹配对象,returns 替换文本。 lambda
函数用于执行此操作,它只是在 phrases
字典中查找匹配对象。
与其进行字典查找,不如在此处使用 replace()
,但预先计算的替换文本应该更快。添加 \b
是为了仅进行单词边界上的替换,因此例如 MYNew York
将被跳过。如果需要,将 flags=re.I
添加到 re.compile()
可用于使搜索不区分大小写。