用 Python 替换文本中的几个单词

Replace several words in a text with Python

我使用下面的代码从文件中删除所有 HTML 标签并将其转换为纯文本。此外,我必须将 XML/HTML 个字符转换为 ASCII 字符。在这里,我有 21 行来阅读整个文本。这意味着如果我想转换一个巨大的文件,我必须花费很多资源来做这件事。


# -*- coding: utf-8 -*-
import re

# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()

# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('‘',"""'""")
temp = temp.replace ('’',"""'""")
temp = temp.replace ('“',"""\"""")
temp = temp.replace ('”',"""\"""")
temp = temp.replace ('‚',""",""")
temp = temp.replace ('′',"""'""")
temp = temp.replace ('″',"""\"""")
temp = temp.replace ('«',"""«""")
temp = temp.replace ('»',"""»""")
temp = temp.replace ('‹',"""‹""")
temp = temp.replace ('›',"""›""")
temp = temp.replace ('&',"""&""")
temp = temp.replace ('–',""" – """)
temp = temp.replace ('—',""" — """)
temp = temp.replace ('®',"""®""")
temp = temp.replace ('©',"""©""")
temp = temp.replace ('™',"""™""")
temp = temp.replace ('¶',"""¶""")
temp = temp.replace ('•',"""•""")
temp = temp.replace ('·',"""·""")

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)

# Write the result to a new file.
file = open("output-file.txt", "w")

你可以使用 string.translate()

from string import maketrans   # Required to call maketrans function.

intab = "string of original characters that need to be replaced"
outtab = "string of new characters"
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table

str = "this is string example....wow!!!";#you string
print str.translate(trantab);

请注意,在 python3 中,str.translate 会比在 python2 中慢得多,尤其是当您只翻译几个字符时。这是因为它必须处理 unicode 字符,因此使用字典来执行翻译而不是索引字符串。

我的第一直觉是 string.translate() in combination with string.maketrans() This will make only one pass instead of several. Each call to str.replace() 自己传递整个字符串,而您想避免这种情况。


from string import ascii_lowercase, maketrans, translate

from_str = ascii_lowercase
to_str = from_str[-1]+from_str[0:-1]
foo = 'the quick brown fox jumps over the lazy dog.'
bar = translate(foo, maketrans(from_str, to_str))
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.

使用sting.tranlate()string.maketran() 的问题是,当我使用它们时,我必须将一个字符分配给另一个字符。例如

print string.maketran("abc","123")

但是,我需要将 HTML/XML 字符分配给 ASCII 中的单引号 ('),例如 &lsquo;。这意味着我必须使用以下代码:

print string.maketran("&lsquo;","'")


ValueError: maketrans arguments must have same length

然而,如果我使用 HTMLParser,它会将所有 HTML/XML 转换为 ASCII,而不会出现上述问题。我还添加了一个 encode('utf-8') 来解决以下错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 246: ordinal not in range(128)

# -*- coding: utf-8 -*-
import re
from HTMLParser import HTMLParser

# This file contains HTML.
file = open('input-file.txt', 'r')
temp = file.read()

# Replace all XML/HTML characters to ASCII ones.
temp = HTMLParser.unescape.__func__(HTMLParser, temp)

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)

# Encode the text to UTF-8 for preventing some errors.
result = result.encode('utf-8')

# Write the result to a new file.
file = open("output-file.txt", "w")