HTML XML 页面上的解析器,正在获取数据字典
HTML Parser on XML page, getting a dictionary of data
我有一个 xml 页面,信息如下:
<currency xmlns:xxsi>
<Observation>
<Currency_name>U.S. dollar </Currency_name>
<Observation_ISO4217>USD</Observation_ISO4217>
<Observation_date>2015-03-09</Observation_date>
<Observation_data>1.2598</Observation_data>
<Observation_data_reciprocal>0.7938</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>U.S. dollar </Currency_name>
<Observation_ISO4217>USD</Observation_ISO4217>
<Observation_date>2015-03-11</Observation_date>
<Observation_data>1.2764</Observation_data>
<Observation_data_reciprocal>0.7835</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>Argentine peso</Currency_name>
<Observation_ISO4217>ARS</Observation_ISO4217>
<Observation_date>2015-03-09</Observation_date>
<Observation_data>0.1438</Observation_data>
<Observation_data_reciprocal>6.9541</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>Argentine peso</Currency_name>
<Observation_ISO4217>ARS</Observation_ISO4217>
<Observation_date>2015-03-10</Observation_date>
<Observation_data>0.1440</Observation_data>
<Observation_data_reciprocal>6.9444</Observation_data_reciprocal>
</Observation>
</currency>
我想要一种方法来处理数据以便从中获取信息,例如如果我想比较同一货币的两个日期,或者如果我想比较两个不同国家/地区的货币。我遇到的问题是试图将该信息放入字典中作为存储它的好方法。
我目前正在使用以下代码,但由于同一国家/地区的多个数据,它无法正常工作。实际页面每个国家有五 (5) 个相同的国家(总共 57 个)
class myHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.country = []
self.data = []
self.dic = {}
self.nameFlag = False
def handle_starttag(self, tag, attrs):
if tag == 'currency_name':
self.nameFlag = True
else:
self.nameFlag = False
def handle_endtag(self, tag):
pass
def handle_data(self, data):
if data.strip() != '' and self.nameFlag == True:
self.dic[data.strip()] = []
谁能帮我找到一个存储多个国家数据的好方法?
假设您的标记语言中没有嵌套元素,您可以从一个像这样的简单解析器开始:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.content = []
self.observation = False
self.element = None
def handle_starttag(self, tag, attrs):
print(tag)
if tag == 'observation':
self.content.append({})
self.observation = True
elif self.observation:
self.element = tag
self.content[-1][self.element] = ""
def handle_endtag(self, tag):
if tag == 'observation':
self.observation = False
self.element = None
def handle_data(self, data):
if self.element:
self.content[-1][self.element] += data
from pprint import pprint
with open("data.someml", "rt") as infile:
parser = MyHTMLParser()
parser.feed(infile.read())
pprint(parser.content)
给定您的输入文件,这将产生:
[{'currency_name': 'U.S. dollar ',
'observation_data': '1.2598',
'observation_data_reciprocal': '0.7938',
'observation_date': '2015-03-09',
'observation_iso4217': 'USD'},
{'currency_name': 'U.S. dollar ',
'observation_data': '1.2764',
'observation_data_reciprocal': '0.7835',
'observation_date': '2015-03-11',
'observation_iso4217': 'USD'},
{'currency_name': 'Argentine peso',
'observation_data': '0.1438',
'observation_data_reciprocal': '6.9541',
'observation_date': '2015-03-09',
'observation_iso4217': 'ARS'},
{'currency_name': 'Argentine peso',
'observation_data': '0.1440',
'observation_data_reciprocal': '6.9444',
'observation_date': '2015-03-10',
'observation_iso4217': 'ARS'}]
这里的关键思想是每次遇到 observation
开始标记时创建一个新记录(作为字典)。鉴于前面解释的假设,任何其他开始标记都将引入一个数据字段。
如果您不关心如何解析XML,我建议您使用Martin Blech's xmltodict
模块。
由于您的文件缺少单个文档元素,因此您需要哄骗它与类似的东西合作:
import xmltodict
with open('input.txt') as f:
data = f.read()
d = xmltodict.parse("<root>" + data + "</root>")
d = d['root']
然后您可以使用以下内容访问 XML 结构:
print(d['Observation'][0]['Currency_name']) # U.S. dollar
print(d['Observation'][0]['Observation_date']) # 2015-03-09
或者,遍历所有观察结果:
for obs in d['Observation']:
print(obs['Currency_name'])
print(obs['Observation_date'])
print(obs['Observation_data'])
print('---')
输出:
U.S. dollar
2015-03-09
1.2598
---
U.S. dollar
2015-03-11
1.2764
---
Argentine peso
2015-03-09
0.1438
---
Argentine peso
2015-03-10
0.1440
---
我有一个 xml 页面,信息如下:
<currency xmlns:xxsi>
<Observation>
<Currency_name>U.S. dollar </Currency_name>
<Observation_ISO4217>USD</Observation_ISO4217>
<Observation_date>2015-03-09</Observation_date>
<Observation_data>1.2598</Observation_data>
<Observation_data_reciprocal>0.7938</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>U.S. dollar </Currency_name>
<Observation_ISO4217>USD</Observation_ISO4217>
<Observation_date>2015-03-11</Observation_date>
<Observation_data>1.2764</Observation_data>
<Observation_data_reciprocal>0.7835</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>Argentine peso</Currency_name>
<Observation_ISO4217>ARS</Observation_ISO4217>
<Observation_date>2015-03-09</Observation_date>
<Observation_data>0.1438</Observation_data>
<Observation_data_reciprocal>6.9541</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>Argentine peso</Currency_name>
<Observation_ISO4217>ARS</Observation_ISO4217>
<Observation_date>2015-03-10</Observation_date>
<Observation_data>0.1440</Observation_data>
<Observation_data_reciprocal>6.9444</Observation_data_reciprocal>
</Observation>
</currency>
我想要一种方法来处理数据以便从中获取信息,例如如果我想比较同一货币的两个日期,或者如果我想比较两个不同国家/地区的货币。我遇到的问题是试图将该信息放入字典中作为存储它的好方法。
我目前正在使用以下代码,但由于同一国家/地区的多个数据,它无法正常工作。实际页面每个国家有五 (5) 个相同的国家(总共 57 个)
class myHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.country = []
self.data = []
self.dic = {}
self.nameFlag = False
def handle_starttag(self, tag, attrs):
if tag == 'currency_name':
self.nameFlag = True
else:
self.nameFlag = False
def handle_endtag(self, tag):
pass
def handle_data(self, data):
if data.strip() != '' and self.nameFlag == True:
self.dic[data.strip()] = []
谁能帮我找到一个存储多个国家数据的好方法?
假设您的标记语言中没有嵌套元素,您可以从一个像这样的简单解析器开始:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.content = []
self.observation = False
self.element = None
def handle_starttag(self, tag, attrs):
print(tag)
if tag == 'observation':
self.content.append({})
self.observation = True
elif self.observation:
self.element = tag
self.content[-1][self.element] = ""
def handle_endtag(self, tag):
if tag == 'observation':
self.observation = False
self.element = None
def handle_data(self, data):
if self.element:
self.content[-1][self.element] += data
from pprint import pprint
with open("data.someml", "rt") as infile:
parser = MyHTMLParser()
parser.feed(infile.read())
pprint(parser.content)
给定您的输入文件,这将产生:
[{'currency_name': 'U.S. dollar ',
'observation_data': '1.2598',
'observation_data_reciprocal': '0.7938',
'observation_date': '2015-03-09',
'observation_iso4217': 'USD'},
{'currency_name': 'U.S. dollar ',
'observation_data': '1.2764',
'observation_data_reciprocal': '0.7835',
'observation_date': '2015-03-11',
'observation_iso4217': 'USD'},
{'currency_name': 'Argentine peso',
'observation_data': '0.1438',
'observation_data_reciprocal': '6.9541',
'observation_date': '2015-03-09',
'observation_iso4217': 'ARS'},
{'currency_name': 'Argentine peso',
'observation_data': '0.1440',
'observation_data_reciprocal': '6.9444',
'observation_date': '2015-03-10',
'observation_iso4217': 'ARS'}]
这里的关键思想是每次遇到 observation
开始标记时创建一个新记录(作为字典)。鉴于前面解释的假设,任何其他开始标记都将引入一个数据字段。
如果您不关心如何解析XML,我建议您使用Martin Blech's xmltodict
模块。
由于您的文件缺少单个文档元素,因此您需要哄骗它与类似的东西合作:
import xmltodict
with open('input.txt') as f:
data = f.read()
d = xmltodict.parse("<root>" + data + "</root>")
d = d['root']
然后您可以使用以下内容访问 XML 结构:
print(d['Observation'][0]['Currency_name']) # U.S. dollar
print(d['Observation'][0]['Observation_date']) # 2015-03-09
或者,遍历所有观察结果:
for obs in d['Observation']:
print(obs['Currency_name'])
print(obs['Observation_date'])
print(obs['Observation_data'])
print('---')
输出:
U.S. dollar 2015-03-09 1.2598 --- U.S. dollar 2015-03-11 1.2764 --- Argentine peso 2015-03-09 0.1438 --- Argentine peso 2015-03-10 0.1440 ---