两个名称相同但位置不同的标签 xml
Two tags with the same name but different location xml
我想将 XML 文件制作成 Python 中的 JSON 文件。我目前正在尝试从 XML 文件中提取信息以将其放入字典或数据帧中。
这是 XML 文件:
<?xml version="1.0" encoding="UTF-8"?>
<Terms>
<Term>
<Title>.177 (4.5mm) Airgun</Title>
<Description>The standard airgun calibre for international target shooting.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
<Term>
<Title>.22</Title>
<Description>A rimfire calibre, much used in target shooting and often synonymous with the term smallbore.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
<Term>
<Title>.22 Long Rifle</Title>
<Description>The standard .22 rimfire cartridge for target rifle and pistol use.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
<Term>
<Title>.22 Short</Title>
<Description>Used as a target shooting round for timed fire pistol competitions.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
</Terms>
当我调用 Title 标签时,它给了我所有的 Title 标签。但是,我想将 Title 标签和 RelatedTerms 标签中嵌入的 Title 标签分开。
xml_file = open('xml.xml', encoding='UTF-8')
soup = BeautifulSoup(xml_file, 'lxml-xml', from_encoding='UTF-8')
Terms = soup.select('Terms > Term')
jsonObj = {"thesaurus": []}
for term in Terms:
termDetail = {
"Description": term.find('Description').text,
"Title": term.find('Title').text
}
RelatedTerms = term.select('RelatedTerms > Term')
if RelatedTerms:
termDetail["RelatedTerms"] = []
for rterm in RelatedTerms:
termDetail["RelatedTerms"].append({
"Title": rterm.find('Title').text,
"Relationship": rterm.find('Relationship').text
})
jsonObj["thesaurus"].append(termDetail)
print(json.dumps(jsonObj))
好的,我已经更新了上面的代码,它基本上可以工作了。但是,"Title": rterm.find('Title').text
代码给出了错误
AttributeError: 'NoneType' 对象没有属性 'text'
我不知道为什么,因为里面有文字
我将使用 parsel 提取您的数据 - 您的数据嵌入在术语和关系中,因此请相应地调整您的代码:
from parsel import Selector
data = """[your code above here]"""
selector = Selector(data)
#extract titles in Terms :
title_in_terms = selector.xpath(".//terms/term/title/text()").getall()
title_in_terms
['.177 (4.5mm) Airgun', '.22', '.22 Long Rifle', '.22 Short']
#extract title in relationship terms:
title_in_relationship_terms = selector.xpath(".//relatedterms/term/title/text()").getall()
title_in_relationship_terms
['Shooting sport equipment',
'Shooting sport equipment',
'Shooting sport equipment',
'Shooting sport equipment']
我创建了一个有效的解决方案,它只使用您在代码中指定的包。它看起来像这样:
from bs4 import BeautifulSoup as bs
import lxml
xml_file = open('xml.xml', encoding='UTF-8')
soup = bs(xml_file, 'lxml-xml', from_encoding='UTF-8')
term = soup.find_all('Term')[0]
main_title = term.find_all('Title')[0]
related_terms = term.find_all('RelatedTerms')[0]
embedded_title = related_terms.find_all('Title')[0]
print(main_title.string)
print(embedded_title.string)
输出:
.177 (4.5mm) Airgun
Shooting sport equipment
代码强烈保证所有标签至少有一个指定的子标签。因此,如果您有一个没有该保证的 XML 文件,则必须检查结果标签列表是否为空。
仅使用 BeautifulSoup
,当 xml_text
是问题中的 xml 文本时,则此脚本:
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_text, 'xml')
data = []
for title, description in zip(soup.select('Terms > Term > Title'), soup.select('Terms > Term > Description')):
data.append({'Title': title.get_text(strip=True),
'Description': description.get_text(strip=True),
'Related Terms': [(rel_title.get_text(strip=True), rel.get_text(strip=True)) for rel_title, rel in zip(
title.find_parent('Term').select('RelatedTerms > Term > Title'),
title.find_parent('Term').select('RelatedTerms > Term > Relationship') )]})
df = pd.DataFrame(data)
print(df)
已创建 Pandas 数据框:
Title Description Related Terms
0 .177 (4.5mm) Airgun The standard airgun calibre for international ... [(Shooting sport equipment, Narrower Term)]
1 .22 A rimfire calibre, much used in target shootin... [(Shooting sport equipment, Narrower Term)]
2 .22 Long Rifle The standard .22 rimfire cartridge for target ... [(Shooting sport equipment, Narrower Term)]
3 .22 Short Used as a target shooting round for timed fire... [(Shooting sport equipment, Narrower Term)]
我想将 XML 文件制作成 Python 中的 JSON 文件。我目前正在尝试从 XML 文件中提取信息以将其放入字典或数据帧中。
这是 XML 文件:
<?xml version="1.0" encoding="UTF-8"?>
<Terms>
<Term>
<Title>.177 (4.5mm) Airgun</Title>
<Description>The standard airgun calibre for international target shooting.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
<Term>
<Title>.22</Title>
<Description>A rimfire calibre, much used in target shooting and often synonymous with the term smallbore.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
<Term>
<Title>.22 Long Rifle</Title>
<Description>The standard .22 rimfire cartridge for target rifle and pistol use.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
<Term>
<Title>.22 Short</Title>
<Description>Used as a target shooting round for timed fire pistol competitions.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
</Terms>
当我调用 Title 标签时,它给了我所有的 Title 标签。但是,我想将 Title 标签和 RelatedTerms 标签中嵌入的 Title 标签分开。
xml_file = open('xml.xml', encoding='UTF-8')
soup = BeautifulSoup(xml_file, 'lxml-xml', from_encoding='UTF-8')
Terms = soup.select('Terms > Term')
jsonObj = {"thesaurus": []}
for term in Terms:
termDetail = {
"Description": term.find('Description').text,
"Title": term.find('Title').text
}
RelatedTerms = term.select('RelatedTerms > Term')
if RelatedTerms:
termDetail["RelatedTerms"] = []
for rterm in RelatedTerms:
termDetail["RelatedTerms"].append({
"Title": rterm.find('Title').text,
"Relationship": rterm.find('Relationship').text
})
jsonObj["thesaurus"].append(termDetail)
print(json.dumps(jsonObj))
好的,我已经更新了上面的代码,它基本上可以工作了。但是,"Title": rterm.find('Title').text
代码给出了错误
AttributeError: 'NoneType' 对象没有属性 'text'
我不知道为什么,因为里面有文字
我将使用 parsel 提取您的数据 - 您的数据嵌入在术语和关系中,因此请相应地调整您的代码:
from parsel import Selector
data = """[your code above here]"""
selector = Selector(data)
#extract titles in Terms :
title_in_terms = selector.xpath(".//terms/term/title/text()").getall()
title_in_terms
['.177 (4.5mm) Airgun', '.22', '.22 Long Rifle', '.22 Short']
#extract title in relationship terms:
title_in_relationship_terms = selector.xpath(".//relatedterms/term/title/text()").getall()
title_in_relationship_terms
['Shooting sport equipment',
'Shooting sport equipment',
'Shooting sport equipment',
'Shooting sport equipment']
我创建了一个有效的解决方案,它只使用您在代码中指定的包。它看起来像这样:
from bs4 import BeautifulSoup as bs
import lxml
xml_file = open('xml.xml', encoding='UTF-8')
soup = bs(xml_file, 'lxml-xml', from_encoding='UTF-8')
term = soup.find_all('Term')[0]
main_title = term.find_all('Title')[0]
related_terms = term.find_all('RelatedTerms')[0]
embedded_title = related_terms.find_all('Title')[0]
print(main_title.string)
print(embedded_title.string)
输出:
.177 (4.5mm) Airgun
Shooting sport equipment
代码强烈保证所有标签至少有一个指定的子标签。因此,如果您有一个没有该保证的 XML 文件,则必须检查结果标签列表是否为空。
仅使用 BeautifulSoup
,当 xml_text
是问题中的 xml 文本时,则此脚本:
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_text, 'xml')
data = []
for title, description in zip(soup.select('Terms > Term > Title'), soup.select('Terms > Term > Description')):
data.append({'Title': title.get_text(strip=True),
'Description': description.get_text(strip=True),
'Related Terms': [(rel_title.get_text(strip=True), rel.get_text(strip=True)) for rel_title, rel in zip(
title.find_parent('Term').select('RelatedTerms > Term > Title'),
title.find_parent('Term').select('RelatedTerms > Term > Relationship') )]})
df = pd.DataFrame(data)
print(df)
已创建 Pandas 数据框:
Title Description Related Terms
0 .177 (4.5mm) Airgun The standard airgun calibre for international ... [(Shooting sport equipment, Narrower Term)]
1 .22 A rimfire calibre, much used in target shootin... [(Shooting sport equipment, Narrower Term)]
2 .22 Long Rifle The standard .22 rimfire cartridge for target ... [(Shooting sport equipment, Narrower Term)]
3 .22 Short Used as a target shooting round for timed fire... [(Shooting sport equipment, Narrower Term)]