用 Python 替换文本中的几个单词

Question

我使用下面的代码从文件中删除所有 HTML 标签并将其转换为纯文本。此外，我必须将 XML/HTML 个字符转换为 ASCII 字符。在这里，我有 21 行来阅读整个文本。这意味着如果我想转换一个巨大的文件，我必须花费很多资源来做这件事。

你有什么想法可以提高代码的效率和速度，同时减少资源的使用？

# -*- coding: utf-8 -*-
import re

# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()

# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('&lsquo;',"""'""")
temp = temp.replace ('&rsquo;',"""'""")
temp = temp.replace ('&ldquo;',"""\"""")
temp = temp.replace ('&rdquo;',"""\"""")
temp = temp.replace ('&sbquo;',""",""")
temp = temp.replace ('&prime;',"""'""")
temp = temp.replace ('&Prime;',"""\"""")
temp = temp.replace ('&laquo;',"""«""")
temp = temp.replace ('&raquo;',"""»""")
temp = temp.replace ('&lsaquo;',"""‹""")
temp = temp.replace ('&rsaquo;',"""›""")
temp = temp.replace ('&amp;',"""&""")
temp = temp.replace ('&ndash;',""" – """)
temp = temp.replace ('&mdash;',""" — """)
temp = temp.replace ('&reg;',"""®""")
temp = temp.replace ('&copy;',"""©""")
temp = temp.replace ('&trade;',"""™""")
temp = temp.replace ('&para;',"""¶""")
temp = temp.replace ('&bull;',"""•""")
temp = temp.replace ('&middot;',"""·""")

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()

Answer 1

你可以使用 string.translate()

from string import maketrans   # Required to call maketrans function.

intab = "string of original characters that need to be replaced"
outtab = "string of new characters"
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table

str = "this is string example....wow!!!";#you string
print str.translate(trantab);

请注意，在 python3 中，str.translate 会比在 python2 中慢得多，尤其是当您只翻译几个字符时。这是因为它必须处理 unicode 字符，因此使用字典来执行翻译而不是索引字符串。

Answer 2

我的第一直觉是 string.translate() in combination with string.maketrans() This will make only one pass instead of several. Each call to str.replace() 自己传递整个字符串，而您想避免这种情况。

一个例子：

from string import ascii_lowercase, maketrans, translate

from_str = ascii_lowercase
to_str = from_str[-1]+from_str[0:-1]
foo = 'the quick brown fox jumps over the lazy dog.'
bar = translate(foo, maketrans(from_str, to_str))
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.

Answer 3

使用sting.tranlate() 或string.maketran() 的问题是，当我使用它们时，我必须将一个字符分配给另一个字符。例如

print string.maketran("abc","123")

但是，我需要将 HTML/XML 字符分配给 ASCII 中的单引号 (')，例如 ‘。这意味着我必须使用以下代码：

print string.maketran("&lsquo;","'")

它面临以下错误：

ValueError: maketrans arguments must have same length

然而，如果我使用 HTMLParser，它会将所有 HTML/XML 转换为 ASCII，而不会出现上述问题。我还添加了一个 encode('utf-8') 来解决以下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 246: ordinal not in range(128)

# -*- coding: utf-8 -*-
import re
from HTMLParser import HTMLParser

# This file contains HTML.
file = open('input-file.txt', 'r')
temp = file.read()

# Replace all XML/HTML characters to ASCII ones.
temp = HTMLParser.unescape.__func__(HTMLParser, temp)

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)

# Encode the text to UTF-8 for preventing some errors.
result = result.encode('utf-8')
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()

用 Python 替换文本中的几个单词

Replace several words in a text with Python

python

unicode

performance

processing-efficiency