阿拉伯文本不仅在 lxml 输出中显示为字符实体

Question

我的 S005_179-205M-2 格式化 XML file:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"   xml:base="http://example.org" xml:id="example_v1" >
    <teiHeader>
    <fileDesc>
            <titleStmt>
        <title>test</title>
    </titleStmt>
            <publicationStmt>
        <p>test</p>
    </publicationStmt>
            <sourceDesc>
            <p>test</p>
        </sourceDesc>
    </fileDesc>
    </teiHeader>
    <text xml:lang="ar">
        <body>
<div type="chapter" n="5" xml:lang="ar">

<div type="section" n="5.179">
<head type="30">الْقَوْلُ فِي تَأْوِيلِ قَوْلِهِ : <quote type="quran" n="5:74">أَفَلا يَتُوبُونَ إِلَى اللَّهِ وَيَسْتَغْفِرُونَهُ وَاللَّهُ غَفُورٌ رَحِيمٌ </quote></head>
<p n="nothadith" ana="adyan kalam yes">يقول تعالى ذكره : أفلا يرجع هذان الفريقان <name
                            role="organization">الكافران</name> ، القائل أحدهما : <quote
                            type="quran" n="5:72">إِنَّ اللَّهَ هُوَ <name role="person">الْمَسِيحُ
                                ابْنُ مَرْيَمَ</name>
                        </quote> ، والآخر القائل : <quote type="quran" n="5:73">إِنَّ اللَّهَ
                            ثَالِثُ ثَلاثَةٍ </quote> ، عما قالا من ذلك ، و ينيبان مما قالا ونطقا به
                        من كفرهما ، ويسألان ربهما المغفرة مما قالا : <quote type="quran" n="5:74"
                            >وَاللَّهُ غَفُورٌ </quote> ، لذنوب التائبين من خلقه ، المنيبين إلى <pb
                            type="turki" n="8:582"/> طاعته بعد معصيتهم ، <quote type="quran"
                            n="5:34">رَحِيمٌ </quote> بهم في قبوله توبتَهم ، ومراجعتَهم إلى ما يحب
                        مما يكره ، فيصفح بذلك من فعلهم عما سلف من إجرامهم قبل ذلك . </p>
</div>

</div>

        </body>
    </text>
</TEI>

正在通过以下命令读取文件：

from lxml import etree

tree = etree.parse('S005_179-205M-2 formated.xml')

通过

打印树

root = tree.getroot()
print(etree.tostring(root))

输出文件看起来像

它应该以阿拉伯语打印。我检查过解析器不是用阿拉伯语阅读的。如何确保解析器使用 Unicode 进行解析？

Answer 1

下面的代码解析并从 xml

中提取一些信息

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:base="http://example.org" xml:id="example_v1">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>test</title>
         </titleStmt>
         <publicationStmt>
            <p>test</p>
         </publicationStmt>
         <sourceDesc>
            <p>test</p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text xml:lang="ar">
      <body>
         <div type="chapter" n="5" xml:lang="ar">
            <div type="section" n="5.179">
               <head type="30">
                  الْقَوْلُ فِي تَأْوِيلِ قَوْلِهِ :
                  <quote type="quran" n="5:74">أَفَلا يَتُوبُونَ إِلَى اللَّهِ وَيَسْتَغْفِرُونَهُ وَاللَّهُ غَفُورٌ رَحِيمٌ</quote>
               </head>
               <p n="nothadith" ana="adyan kalam yes">
                  يقول تعالى ذكره : أفلا يرجع هذان الفريقان
                  <name role="organization">الكافران</name>
                  ، القائل أحدهما :
                  <quote type="quran" n="5:72">
                     إِنَّ اللَّهَ هُوَ
                     <name role="person">الْمَسِيحُ
                                ابْنُ مَرْيَمَ</name>
                  </quote>
                  ، والآخر القائل :
                  <quote type="quran" n="5:73">إِنَّ اللَّهَ
                            ثَالِثُ ثَلاثَةٍ</quote>
                  ، عما قالا من ذلك ، و ينيبان مما قالا ونطقا به
                        من كفرهما ، ويسألان ربهما المغفرة مما قالا :
                  <quote type="quran" n="5:74">وَاللَّهُ غَفُورٌ</quote>
                  ، لذنوب التائبين من خلقه ، المنيبين إلى
                  <pb type="turki" n="8:582" />
                  طاعته بعد معصيتهم ،
                  <quote type="quran" n="5:34">رَحِيمٌ</quote>
                  بهم في قبوله توبتَهم ، ومراجعتَهم إلى ما يحب
                        مما يكره ، فيصفح بذلك من فعلهم عما سلف من إجرامهم قبل ذلك .
               </p>
            </div>
         </div>
      </body>
   </text>
</TEI>'''

root = ET.fromstring(xml)
for idx,quote in enumerate(root.findall('.//{http://www.tei-c.org/ns/1.0}quote'),1):
  print(f'{idx}): {quote.text.strip()}')

输出

1): أَفَلا يَتُوبُونَ إِلَى اللَّهِ وَيَسْتَغْفِرُونَهُ وَاللَّهُ غَفُورٌ رَحِيمٌ
2): إِنَّ اللَّهَ هُوَ
3): إِنَّ اللَّهَ
                            ثَالِثُ ثَلاثَةٍ
4): وَاللَّهُ غَفُورٌ
5): رَحِيمٌ

Answer 2

您的解析器正在以 unicode 进行解析，但 tostring 未编写 unicode。

使用etree.tostring(root, encoding="unicode")或etree.tostring(root, encoding="utf-8")

阿拉伯文本不仅在 lxml 输出中显示为字符实体

Arabic text not only showing as character entities in lxml output

python

xml

lxml