xml.etree.Elementree: 为什么标签下面的文字没有被识别

xml.etree.Elementree: Why is text below the tag not being identified

我写了这个脚本来将有问题的网页(在代码中)抓取到 XML 文件:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

xml = open("import.xml", "w+")
xml.write(urlopen('http://mahmi.org/api/peptides/sourceProteins/241282699').read().decode('utf-8'))
xml.close()

当我打开文件 'import.xml' 时,我可以看到数据在那里;即文件的开头如下所示:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><sourceProteins><sourceProtein><protein><id>2232238</id><sequence>MLLTNFQNFASLHAVPVAQIRAMEACPLPTEPIRCVIRELDVSKLTPDQLTQLNEVIDGYNKDLAFMIEELHKRANRRYCHGKNFIKWRGLLRAAHAVVHAALPPGMQKTHLLSKGGLQGKMWKTALEDACSTMDRYWRSIQVAVYCELRNKEFYSKLNDAEKYYVGCLLNNTGYLFFDMLDGKTPKPALPNKLKGKLSDPRNLCRKVRATVRRHSRRLPRHGVDRSCSLTTECYSVTQDSQGNQTISVITNTRGKRLLIPVKGKGRIGRTIKIVRDNGKFYLHIPLKTPVVPFEHIPRAPLAAGKATLHCTALDMGYTEVFTDDAGNFYGTELGKTLDAIGRKLDEVYRERNRWHARYRNEKDDKKKLNILRFNLGRKKLDAFETRARARVVCLVNKAINDIMAMRPADVYLIERFGQQFNFAGLSKKTRRKLSGWIRGTIEERFFFKASIHGAKAVYVPASYSSRRCPVCGYVHKTNRNGD</sequence><name>T2D-154A_GL0135792</name></protein><uniprotData><uniprotId>O66401</uniprotId><uniprotOrganism>Aquifex aeolicus (strain VF5)</uniprotOrganism><uniprotProtein>YZ05_AQUAE Putative...

所以现在我想读入那个文件,例如打印出 uniprotData 标签下的所有文本:

我写了这段代码:

import xml.etree.ElementTree as ET
fileopen = open('import.xml').read()
root = ET.fromstring(fileopen)
for x in root.iter('uniprotData'):
        print(x.text)

但是输出是'None'。 有人可以解释为什么会这样吗?

正如评论中所建议的,您需要在 x 上再迭代一次。事实上,xxml.etree.ElementTree.Element 类型。

代码

fileopen = open('import.xml').read()
root = ET.fromstring(fileopen)
for x in root.iter('uniprotData'):
    print(type(x))
    # <class 'xml.etree.ElementTree.Element'>

    for child in x:
        print("{:20} -> {}".format(child.tag,child.text))
        # uniprotId            -> O66401
        # uniprotOrganism      -> Aquifex aeolicus (strain VF5)
        # uniprotProtein       -> YZ05_AQUAE Putative transposase aq_aa05
        # uniprotGene          -> aq_aa05