xml.etree.Elementree: 为什么标签下面的文字没有被识别
xml.etree.Elementree: Why is text below the tag not being identified
我写了这个脚本来将有问题的网页(在代码中)抓取到 XML 文件:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
xml = open("import.xml", "w+")
xml.write(urlopen('http://mahmi.org/api/peptides/sourceProteins/241282699').read().decode('utf-8'))
xml.close()
当我打开文件 'import.xml' 时,我可以看到数据在那里;即文件的开头如下所示:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><sourceProteins><sourceProtein><protein><id>2232238</id><sequence>MLLTNFQNFASLHAVPVAQIRAMEACPLPTEPIRCVIRELDVSKLTPDQLTQLNEVIDGYNKDLAFMIEELHKRANRRYCHGKNFIKWRGLLRAAHAVVHAALPPGMQKTHLLSKGGLQGKMWKTALEDACSTMDRYWRSIQVAVYCELRNKEFYSKLNDAEKYYVGCLLNNTGYLFFDMLDGKTPKPALPNKLKGKLSDPRNLCRKVRATVRRHSRRLPRHGVDRSCSLTTECYSVTQDSQGNQTISVITNTRGKRLLIPVKGKGRIGRTIKIVRDNGKFYLHIPLKTPVVPFEHIPRAPLAAGKATLHCTALDMGYTEVFTDDAGNFYGTELGKTLDAIGRKLDEVYRERNRWHARYRNEKDDKKKLNILRFNLGRKKLDAFETRARARVVCLVNKAINDIMAMRPADVYLIERFGQQFNFAGLSKKTRRKLSGWIRGTIEERFFFKASIHGAKAVYVPASYSSRRCPVCGYVHKTNRNGD</sequence><name>T2D-154A_GL0135792</name></protein><uniprotData><uniprotId>O66401</uniprotId><uniprotOrganism>Aquifex aeolicus (strain VF5)</uniprotOrganism><uniprotProtein>YZ05_AQUAE Putative...
所以现在我想读入那个文件,例如打印出 uniprotData 标签下的所有文本:
我写了这段代码:
import xml.etree.ElementTree as ET
fileopen = open('import.xml').read()
root = ET.fromstring(fileopen)
for x in root.iter('uniprotData'):
print(x.text)
但是输出是'None'。
有人可以解释为什么会这样吗?
正如评论中所建议的,您需要在 x 上再迭代一次。事实上,x
是 xml.etree.ElementTree.Element 类型。
代码:
fileopen = open('import.xml').read()
root = ET.fromstring(fileopen)
for x in root.iter('uniprotData'):
print(type(x))
# <class 'xml.etree.ElementTree.Element'>
for child in x:
print("{:20} -> {}".format(child.tag,child.text))
# uniprotId -> O66401
# uniprotOrganism -> Aquifex aeolicus (strain VF5)
# uniprotProtein -> YZ05_AQUAE Putative transposase aq_aa05
# uniprotGene -> aq_aa05
我写了这个脚本来将有问题的网页(在代码中)抓取到 XML 文件:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
xml = open("import.xml", "w+")
xml.write(urlopen('http://mahmi.org/api/peptides/sourceProteins/241282699').read().decode('utf-8'))
xml.close()
当我打开文件 'import.xml' 时,我可以看到数据在那里;即文件的开头如下所示:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><sourceProteins><sourceProtein><protein><id>2232238</id><sequence>MLLTNFQNFASLHAVPVAQIRAMEACPLPTEPIRCVIRELDVSKLTPDQLTQLNEVIDGYNKDLAFMIEELHKRANRRYCHGKNFIKWRGLLRAAHAVVHAALPPGMQKTHLLSKGGLQGKMWKTALEDACSTMDRYWRSIQVAVYCELRNKEFYSKLNDAEKYYVGCLLNNTGYLFFDMLDGKTPKPALPNKLKGKLSDPRNLCRKVRATVRRHSRRLPRHGVDRSCSLTTECYSVTQDSQGNQTISVITNTRGKRLLIPVKGKGRIGRTIKIVRDNGKFYLHIPLKTPVVPFEHIPRAPLAAGKATLHCTALDMGYTEVFTDDAGNFYGTELGKTLDAIGRKLDEVYRERNRWHARYRNEKDDKKKLNILRFNLGRKKLDAFETRARARVVCLVNKAINDIMAMRPADVYLIERFGQQFNFAGLSKKTRRKLSGWIRGTIEERFFFKASIHGAKAVYVPASYSSRRCPVCGYVHKTNRNGD</sequence><name>T2D-154A_GL0135792</name></protein><uniprotData><uniprotId>O66401</uniprotId><uniprotOrganism>Aquifex aeolicus (strain VF5)</uniprotOrganism><uniprotProtein>YZ05_AQUAE Putative...
所以现在我想读入那个文件,例如打印出 uniprotData 标签下的所有文本:
我写了这段代码:
import xml.etree.ElementTree as ET
fileopen = open('import.xml').read()
root = ET.fromstring(fileopen)
for x in root.iter('uniprotData'):
print(x.text)
但是输出是'None'。 有人可以解释为什么会这样吗?
正如评论中所建议的,您需要在 x 上再迭代一次。事实上,x
是 xml.etree.ElementTree.Element 类型。
代码:
fileopen = open('import.xml').read()
root = ET.fromstring(fileopen)
for x in root.iter('uniprotData'):
print(type(x))
# <class 'xml.etree.ElementTree.Element'>
for child in x:
print("{:20} -> {}".format(child.tag,child.text))
# uniprotId -> O66401
# uniprotOrganism -> Aquifex aeolicus (strain VF5)
# uniprotProtein -> YZ05_AQUAE Putative transposase aq_aa05
# uniprotGene -> aq_aa05