用另一个文件中的单词替换替换单词

Replace words with word-substitutions from another file

我的文本文件 (mytext.txt) 中的单词需要替换为另一个文本文件 (replace.txt) 中提供的其他单词

cat mytext.txt
this is here. and it should be there. 
me is this will become you is that.

cat replace.txt
this that
here there
me you

以下代码未按预期运行。

with open('mytext.txt', 'r') as myf:
    with open('replace.txt' , 'r') as myr:
        for line in myf.readlines():
            for l2 in myr.readlines():
                original, replace = l2.split()
                print line.replace(original, replace)

预期输出:

that is there. and it should be there. 
you is that will become you is that.

您在一次替换后打印该行,然后在下一次替换后再次打印该行。您想在完成所有替换后打印该行。

str.replace(old, new[, count])
Return a copy of the string...

您每次都在丢弃副本,因为您没有将它保存在变量中。换句话说,replace() 不会改变 line.

接下来,单词 there 包含子字符串 here(被 there 替换),因此结果最终为 tthere.

您可以像这样解决这些问题:

import re

with open('replace.txt' , 'r') as f:
    repl_dict = {}

    for line in f:
        key, val = line.split()
        repl_dict[key] = val


with open('mytext.txt', 'r') as f:
    for line in f:
        for key, val in repl_dict.items():
            line = re.sub(r"\b" + key + r"\b", val, line, flags=re.X)
        print line.rstrip()

--output:--
that is there. and it should be there. 
you is that will become you is that.

或者,像这样:

import re

#Create a dict that returns the key itself
# if the key is not found in the dict:
class ReplacementDict(dict):
    def __missing__(self, key):
        self[key] = key
        return key

#Create a replacement dict:
with open('replace.txt') as f:
    repl_dict = ReplacementDict()

    for line in f:
        key, val = line.split()
        repl_dict[key] = val

#Create the necessary inputs for re.sub():
def repl_func(match_obj):
    return repl_dict[match_obj.group(0)]

pattern = r"""
    \w+   #Match a 'word' character, one or more times
"""

regex = re.compile(pattern, flags=re.X)

#Replace the words in each line with the 
#entries in the replacement dict:
with open('mytext.txt') as f:
    for line in f:
        line = re.sub(regex, repl_func, line)
        print line.rstrip())

与 replace.txt 像这样:

this that
here there
me you
there dog

...输出为:

that is there. and it should be dog.
you is that will become you is that.

以下将解决您的问题。您的代码的问题是您在每次更换后都在打印。

最优解为:

myr=open("replace.txt")
replacement=dict()
for i in myr.readlines():
    original,replace=i.split()
    replacement[original]=replace
myf=open("mytext.txt")
for i in myf.readlines():
    for j in i.split():
        if(j in replacement.keys()):
            i=i.replace(j,replacement[j])
    print i

您似乎希望内部循环为 'mytext.txt' 的每一行读取 'replace.txt' 的内容。这是非常低效的,它实际上不会像写的那样工作,因为一旦你读完了 'replace.txt' 的所有行,文件指针就会留在文件的末尾,所以当你尝试处理第二行时'mytext.txt' 'replace.txt' 中将没有任何行可读。

可以 使用 myr.seek(0) 将 myr 文件指针发送回文件的开头,但正如我所说,这不是很有效。更好的策略是将 'replace.txt' 读入适当的数据结构,然后使用该数据对 'mytext.txt'.

的每一行进行替换

用于此的一个好的数据结构是 dict。例如,

replacements = {'this': 'that', 'here': 'there', 'me': 'you'}

你能想出如何从 'replace.txt' 构建这样一个字典吗?

我看到 gman 和 7stud 已经讨论了保存替换结果以便它们累积的问题,所以我不会费心讨论这个问题。 :)

你可以使用 re.sub:

>>> with open('mytext.txt') as f1, open('replace.txt') as f2:
...     my_text = f1.read()
...     for x in f2:
...         x=x.strip().split()
...         my_text = re.sub(r"\b%s\b" % x[0],x[1],my_text)
...     print my_text
... 
that is there. and it should be there. 
you is that will become you is that.

\b%s\b 定义单词边界

编辑: 我纠正了,OP 要求逐字替换而不是简单的字符串替换('become' -> 'become' 而不是'becoyou')。我想一个字典版本可能看起来像这样,使用在 Splitting a string into words and punctuation:

的已接受答案的评论中找到的正则表达式拆分方法
import re

def clean_split(string_input):
    """ 
    Split a string into its component tokens and return as list
    Treat spaces and punctuations, including in-word apostrophes as separate tokens

    >>> clean_split("it's a good day today!")
    ["it", "'", "s", " ", "a", " ", "good", " ", "day", " ", "today", "!"]
    """
    return re.findall(r"[\w]+|[^\w]", string_input)

with open('replace.txt' , 'r') as myr:
    replacements = dict(tuple(line.split()) for line in myr)

with open('mytext.txt', 'r') as myf:
    for line in myf:
        print ''.join(replacements.get(word, word) for word in clean_split(line)),

我无法很好地推理 re 效率,如果有人指出明显的低效率,我将不胜感激。

编辑 2: 好的,我在单词和标点符号之间插入空格,现在 通过将空格视为标记并执行 [= =12=] 而不是 ' '.join()

作为替代方案,我们可以使用 string 的模板 来实现这一点,它可以工作,但是 非常 丑陋且效率低下:

from string import Template

with open('replace.txt', 'r') as myr:
    # read the replacement first and build a dictionary from it
    d = {str(k): v for k,v in [line.strip().split(" ") for line in myr]}

d
{'here': 'there', 'me': 'you', 'this': 'that'}

with open('mytext.txt', 'r') as myf:
    for line in myf:
        print Template('$'+' $'.join(line.strip().replace('$', '_____').\
                  split(' '))).safe_substitute(**d).\
                  replace('$', '').replace('_____', '')

结果:

that is there. and it should be there.
you is that will become you is that.