使用 Python 转义 XML 中的符号和其他特殊字符

Question

我必须解析大量 XML 文件并将其写入文本文件。但是，某些 XML 文件包含 special/illegal 个字符。我尝试过使用不同的方法，例如 Escaping strings for use in XML 但我无法让它工作。我知道解决此问题的一种方法是编辑 XML 文件本身，但有数千个文件。

import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape

fileNum = 0;
saveFile = open('NewYork_1.txt','w')

for path, dirs, files in os.walk("NewYork_1"):
   for f in files:
      fileName = os.path.join(path, f)
      with open(fileName,'r', encoding='utf-8') as myFile:
        # print(myFile.read())
        if "&" in myFile:
            myFile = myFile.replace("&", "&#38;") #This does not work.

此外，一些 XML 文件中也有 unicode。有表情包。

<Thread>
   <ThreadID></ThreadID>
   <Title></Title>
   <InitPost>
      <UserID></UserID>
      <Date></Date>
      <icontent></icontent>
   </InitPost>
   <Post>
      <UserID></UserID>
      <Date></Date>
      <rcontent></rcontent>
  </Post>
</Thread>

Answer 1

试试这个：

import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape

fileNum = 0;
saveFile = open('NewYork_1.txt','w')
saveFile.close()
for path, dirs, files in os.walk("NewYork_1"):
   for f in files:
       fileName = os.path.join(path, f)
       with open(fileName,'a', encoding='utf-8') as myFile:
           myFile=myFile.read()
           if "&" in myFile:
                myFile = myFile.replace("&", "&#38;")

我个人会生成一个文件列表来读取和遍历该列表，而不是使用 os.walk（如果您从以前的函数或单独的脚本中获取列表，您总是可以创建一个文本文件，每个 txt 文件在单独的行上，并遍历行而不是从变量获取以节省 RAM），但每个 txt 文件都是他自己的。

不过正如我所说，我会放弃替换特殊字符的整个想法并使用 bs4 打开文件，搜索您要查找的元素，然后从那里获取。

import bs4
list_of_USER_IDs=[]
with open(fileName,'r', encoding='utf-8') as myFile:
    a=myFile.read()
    b=bs4.beautifulSoup(a)
    for elem in b.findAll('USERID'):
        list_of_USER_IDs.append(elem)

returns USERID 标签之间的数据，但它适用于您想要从中获取数据的任何标签。不用直接解析xml。它与 HTML 并没有太大区别，而 beautifulSoup 就是为此而生的，那么为什么要重新发明轮子呢？

Answer 2

您可以将文件作为字符串读入。

file_str = open("xml_file.xml", "r+")

然后使用xml.etree.ElementTree的fromString函数

tree = xml.etree.ElementTree.fromString(file_str)

然后用它做点什么。

root = tree.getRoot()

这样您就不必担心为每个表情符号和无效符号编写代码。如果要将表情符号和特殊的 utf-8 字符写入 xml，则必须将此行添加到顶部。

 <?xml version="1.0" encoding="utf-8" ?>

Answer 3

还有一件事...

您可以使用 BeautifulSoup 来查找 xml 文件中您要从中提取数据的部分。会根据需要错开处理，而不是将整个文件加载到一个变量中，然后逐个元素地迭代。

例如：

如果您的 xml 文件包含如下标签：

<pos reviews>
<review_id>001<review_id>
<reviewer_name>John Stamos (From full house)</reviewer_name>
<review_text>This is a great item</review_text>

您可以 beautifulsoup.findAll（任何标签）并获取该数据。不过，我不知道你具体在做什么。觉得值得推荐。

使用 Python 转义 XML 中的符号和其他特殊字符

Escaping ampersand and other special characters in XML using Python

python

xml

elementtree

xml-parsing

python-3.x