使用 Python（html2text、textile）将 HTML 转换为 Ascii 并返回的问题

Question

我正在尝试将 HTML 文本转换为 ASCII，然后运行将其写入，然后再将其转换回 HTML。

到目前为止，在测试脚本的基本结构时，我运行遇到了 textile 无法将所有内容转换回可读 HTML 格式的问题。

我认为这是由缩进输出造成的，这使得 textile 难以阅读 - 但我卡在了这里。

h = html2text.html2text('<p><strong>This is a test:</strong></p><ul><li>This text will be converted to ascii</li><li>and then&nbsp;<strong>translated</strong></li><li>and lastly converted back to HTML</li></ul>')
print(h)

print('------------Converting Back to HTML-----------------------------')


html = textile.textile( h ) 
print (html)

这是我得到的输出：

**This is a test:**

  * This text will be converted to ascii
  * and then  **translated**
  * and lastly converted back to HTML


------------Converting Back to HTML-----------------------------
    <p><b>This is a test:</b></p>

  * This text will be converted to ascii
  * and then  <b>translated</b>
  * and lastly converted back to <span class="caps">HTML</span>

我应该补充一点，我将来会使用来自 excel sheet 的 HTML 数据。

Answer 1

有两种方法可以做到这一点。

第一种方式：

def html_encode(html):
    return html.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;')

第二种方式：

def html_decode(s):
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

用法：

examplehtml = "<html><head></head></html>"
examplehtml2 = "&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;/html&gt;"

print(html_encode(examplehtml))
print(html_decode(examplehtml))

Answer 2

需要注意的一件重要事情是 html2text 将 HTML 转换为 markdown，而不是 textile，因此产生正确结果有点巧合。我建议寻找一个可以理解您正在使用的标记语言的转换器。 Pandoc 几乎可以与任何格式相互转换。

也就是说，缩进导致列表问题是正确的，可以通过 h:

上的简单文本替换来解决

html = textile.textile(h.replace("\n  *", "\n*"))

使用 Python（html2text、textile）将 HTML 转换为 Ascii 并返回的问题

Problems converting HTML to Ascii and back, using Python (html2text, textile)

html

python

formatting

textile

python-3.x