如何在XML中输出非转义元素标签？

Question

我有一个继承的 Python 脚本，我的问题是现在 paragraph 变量中有一段文本包含锚标记。例如：

This is text with a <a href="http://somewebsite.com">Link</a> in it.

然而，我需要做的是将锚标记转换为 apxh 名称 space，因此上面的行应该如下所示：

This is text with a <apxh:a href="http://somewebsite.com">Link</apxh:a> in it.

问题是我上面的输出方式：

This is text with a &lt;apxh:a href=\"http://somewebsite.com;\"&gt;Link Text;&lt;/apxh:a&gt; in it.

我的猜测是，当我运行 paragraph 上的 for 循环时，我需要了解如何找到所有锚标记和文本并执行类似 etree.Element("{%s}a" % nm["apxh"], nsmap=nm) 的操作，但是我不太确定

这是当前脚本：

def get_news_feed(request):
    articles = models.Article.objects.all().filter(distributable = True)

    nm = {
            None: "http://www.w3.org/2005/Atom",
            "ap": "http://ap.org/schemas/03/2005/aptypes",
            "apcm": "http://ap.org/schemas/03/2005/apcm",
            "apnm": "http://ap.org/schemas/03/2005/apnm",
            "apxh": "http://www.w3.org/1999/xhtml",
            }

    doc = etree.Element("{%s}feed" % nm[None], nsmap=nm)

    for article in articles:
        entry = etree.Element("{%s}entry" % nm[None], nsmap=nm)
        content = etree.Element("{%s}content" % nm[None], nsmap=nm)
        content.set("type", "xhtml")

        div = etree.Element("{%s}div" % nm["apxh"], nsmap=nm)
        for paragraph in article.body.replace("&amp;", "&").split("\n"):
            par = etree.Element("{%s}p" % nm["apxh"], nsmap=nm)
            par.text = paragraph            
            par.text = paragraph.replace("<a", "<apxh:a")            
            par.text = par.text.replace("</a", "</apxh:a")  
            par.text = cleanup_entities(par.text)
            div.append(par)
        content.append(div)
        entry.append(content)

        doc.append(entry)

    output = etree.tostring(doc, encoding="UTF-8", xml_declaration=True, pretty_print=True)
    return HttpResponse(output, mimetype="application/xhtml+xml")

输出应该是这样的：

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:ap="http://ap.org/schemas/03/2005/aptypes" xmlns:apxh="http://www.w3.org/1999/xhtml" xmlns:apnm="http://ap.org/schemas/03/2005/apnm" xmlns:apcm="http://ap.org/schemas/03/2005/apcm" xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <content type="xhtml">
      <apxh:div>
        <apxh:p>This is some text</apxh:p>
        <apxh:p>This is text with a <apxh:a href="http://somewebsite.com">Link</apxh:a> in it.</apxh:p>
        <apxh:p>Theater</apxh:p>
      </apxh:div>
    </content>
  </entry>
</feed>

这是当前输出的样子：

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:ap="http://ap.org/schemas/03/2005/aptypes" xmlns:apxh="http://www.w3.org/1999/xhtml" xmlns:apnm="http://ap.org/schemas/03/2005/apnm" xmlns:apcm="http://ap.org/schemas/03/2005/apcm" xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <content type="xhtml">
      <apxh:div>
        <apxh:p>This is some text</apxh:p>
        <apxh:p>This is text with a &lt;apxh:a href=\"http://somewebsite.com;\"&gt;Link Text;&lt;/apxh:a&gt; in it.</apxh:p>
        <apxh:p>Theater</apxh:p>
      </apxh:div>
    </content>
  </entry>
</feed>

Answer 1

不要将您的内容作为文字文本注入——将其渲染到 DOM 内容中，使用隐式使默认命名空间与映射到 aphx:[=22 的命名空间映射相同的命名空间映射=]

import lxml.etree as etree
text='This is text with a <a href="http://somewebsite.com">Link</a> in it.'
text_el = etree.fromstring('<root xmlns="http://www.w3.org/1999/xhtml">' + text + '</root>')

...然后将 text_el 的内容放入您的 par.

这样做可能如下所示：

par = etree.Element('{http://www.w3.org/1999/xhtml}div', nsmap=nm)
par.text = text_el.text
for child_el in text_el[:]:
  par.append(child_el)

因为上面使用了 nsmap nm，将其转换回字符串将遵循其中包含的名称空间前缀，因此使用 apxh 保留在默认名称空间中的内容（我们使用xmlns=里面的人工根）。

在评论区的讨论中，发现你们的一些生产数据是这样的：

u'John Doe: 360-555-4546; <a href=\"mailto:john.doe@website.com;\">John.mailto:john.doe@website.com</a> twitter.com/JohnDoe'

etree.fromstring() 将在给定此输入时抛出异常，因为它因反斜杠而无效 XML（或有效的 XHTML）。

如果您非常确定 \" 永远不会出现在有效输入中，您可以考虑：

text_el = etree.fromstring(
  '<root xmlns="http://www.w3.org/1999/xhtml">' +
  text.replace('\"', '"') +
  '</root>')

如何在XML中输出非转义元素标签？

How Can I Output Non-Escaped Element Tag In XML?

python

xml

html-entities

python-2.5