解析 XML 个不等数量的标签以生成等长的列表。 openpyxl 和 Beautifulsoup
Parsing XML of unequal number of tags to make lists of equal length. openpyxl and Beautifulsoup
我有一个 XML 文件,其中包含包含作者、出版日期、标签等标签的书籍记录。我要解析这个文件以制作 3 个列表,一个将有书名,另一个列表中有作者,最后是第三个列表中的标签,稍后我将使用 openpyxl 将这些列表写入 Excel 列.
问题是有些书籍记录没有标签。使用 Beautiful soup 的常规解析技术将生成长度相同的前两个列表,但标签列表的长度会更短。
我有三个问题:
1- 如何创建所有三个长度相等的列表(没有标签的书籍为空条目?
2- 标签列表如下所示 ['Energy;Green Buildings;High Performance Buildings'、'Computing'、'Computing;Design;Green Buildings'、......]。我创建了另外 15 列,标题为我拥有的标签名称,例如 "Computing" 和 "Design"。如果一本书包含特定标签,例如,如果第 5 行标题为 "Architecture" 的书有"Design" 标签,我需要在单元格(行'5',列'设计')中有一个 X 标记或彩色单元格。
3- 是否有更简单的方法来完成此操作(解析 XML 文件并在 Excel 中高效写入)?
这是 XML 文件和我编写的代码的快照(XML 文件和 Python 文件都可以从这里下载:http://www.ranialabib.com/#!python/icfwa
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<records>
<record>
<database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
<ref-type name="Book">1</ref-type>
<contributors>
<authors>
<author>AIA Research Corporation</author>
</authors>
</contributors>
<titles>
<title>Regional guidelines for building passive energy conserving homes</title>
</titles>
<periodical/>
<keywords/>
<dates>
<year>1978</year>
</dates>
<publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.</publisher>
<urls/>
<label>Energy;Green Buildings;High Performance Buildings</label>
</record>
<record>
<database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
<ref-type name="Book">1</ref-type>
<contributors>
<authors>
<author>Akinci, Burcu</author>
<author>Ph, D</author>
</authors>
</contributors>
<titles>
<title>Computing in Civil Engineering</title>
</titles>
<periodical/>
<pages>692-699</pages>
<keywords/>
<dates>
<year>2007</year>
</dates>
<publisher>American Society of Civil Engineers</publisher>
<isbn>9780784409374</isbn>
<electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num>
<urls>
<web-urls>
<url>http://books.google.com/books?id=QigBgc-qgdoC</url>
</web-urls>
</urls>
</record>
import xml.etree.ElementTree as ET
fhand = open('My_Collection.xml')
data = fhand.read()
Title=list()
Year=list()
Label=list()
tree = ET.fromstring(data)
titles = tree.findall('.//title')
years = tree.findall('.//year')
labels = tree.findall('.//label')
for t in titles :
Title.append(str(t.text))
print 'Titles: ', len(Title)
print Title
for y in years :
Year.append(str(y.text))
print 'years: ', len(Year)
print Year
for l in labels :
Label.append(str(l.text))
print 'Labels: ', len(Label)
print Label
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
for row in zip(Title, Year, Label):
ws.append(row)
wb.save("Test2.xlsx")
这是我根据查理的建议编写的代码,该代码不起作用。我收到一条错误消息 "TypeError: 'NoneType' object is not iterable"。我不确定是什么问题。
另外,如何在一个列表中获取每条记录的所有 3 个标签(标题、年份、标签)的文本,以及将如此大量的列表(200 本书的 200 个列表)写入 Excel 有多容易使用 openpylx?
import xml.etree.ElementTree as ET
fhand = open('My_Collection.xml')
data = fhand.read()
Label_lst=list()
for record in tree.find("records/record") :
label = record.find("label")
for l in label:
if label is not None: label = label_lst.append(label.text)
else:
label = label_lst.append(' ')
print label_lst
如果您想保留记录结构,您应该逐个记录地解析,而不是仅仅创建属性列表。您可以遍历记录并提取相关字段,或者 for record in parsed_xml.find("records/record"); label = record.find("label"); if label is not None: label = label.text
然后您可以将行直接写入 Excel,而无需压缩列。
我刚刚弄明白了。不过我仍然使用列。
from openpyxl import Workbook
import xml.etree.ElementTree as ET
fhand = open ('My_Collection')
tree =ET.parse('My_Collection.xml')
data= fhand.read()
root = tree.getroot()
tree = ET.fromstring(data)
title_list= ['Title']
year_list = ['Year']
author_list= ['Author']
label_list = ['Label']
for child in tree:
for children in child:
if children.find('.//title')is None :
t='N'
else:
t=children.find('.//title').text
title_list.append(t)
print title_list
print len(title_list)
for child in tree:
for children in child:
if children.find('.//year')is None :
y='N'
else:
y=children.find('.//year').text
year_list.append(y)
print year_list
print len(year_list)
for child in tree:
for children in child:
if children.find('.//author')is None :
a='N'
else:
a=children.find('.//author').text
author_list.append(a)
print author_list
print len(author_list)
for child in tree:
for children in child:
if children.find('label')is None :
l='N'
else:
l=children.find('label').text
label_list.append(l)
print label_list
print len(author_list)
for item in label_list:
wb = Workbook()
ws = wb.active
for row in zip(title_list, year_list, author_list, label_list):
ws.append(row)
wb.save("Test3.xlsx")
我有一个 XML 文件,其中包含包含作者、出版日期、标签等标签的书籍记录。我要解析这个文件以制作 3 个列表,一个将有书名,另一个列表中有作者,最后是第三个列表中的标签,稍后我将使用 openpyxl 将这些列表写入 Excel 列. 问题是有些书籍记录没有标签。使用 Beautiful soup 的常规解析技术将生成长度相同的前两个列表,但标签列表的长度会更短。
我有三个问题:
1- 如何创建所有三个长度相等的列表(没有标签的书籍为空条目? 2- 标签列表如下所示 ['Energy;Green Buildings;High Performance Buildings'、'Computing'、'Computing;Design;Green Buildings'、......]。我创建了另外 15 列,标题为我拥有的标签名称,例如 "Computing" 和 "Design"。如果一本书包含特定标签,例如,如果第 5 行标题为 "Architecture" 的书有"Design" 标签,我需要在单元格(行'5',列'设计')中有一个 X 标记或彩色单元格。 3- 是否有更简单的方法来完成此操作(解析 XML 文件并在 Excel 中高效写入)?
这是 XML 文件和我编写的代码的快照(XML 文件和 Python 文件都可以从这里下载:http://www.ranialabib.com/#!python/icfwa
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<records>
<record>
<database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
<ref-type name="Book">1</ref-type>
<contributors>
<authors>
<author>AIA Research Corporation</author>
</authors>
</contributors>
<titles>
<title>Regional guidelines for building passive energy conserving homes</title>
</titles>
<periodical/>
<keywords/>
<dates>
<year>1978</year>
</dates>
<publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.</publisher>
<urls/>
<label>Energy;Green Buildings;High Performance Buildings</label>
</record>
<record>
<database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
<ref-type name="Book">1</ref-type>
<contributors>
<authors>
<author>Akinci, Burcu</author>
<author>Ph, D</author>
</authors>
</contributors>
<titles>
<title>Computing in Civil Engineering</title>
</titles>
<periodical/>
<pages>692-699</pages>
<keywords/>
<dates>
<year>2007</year>
</dates>
<publisher>American Society of Civil Engineers</publisher>
<isbn>9780784409374</isbn>
<electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num>
<urls>
<web-urls>
<url>http://books.google.com/books?id=QigBgc-qgdoC</url>
</web-urls>
</urls>
</record>
import xml.etree.ElementTree as ET
fhand = open('My_Collection.xml')
data = fhand.read()
Title=list()
Year=list()
Label=list()
tree = ET.fromstring(data)
titles = tree.findall('.//title')
years = tree.findall('.//year')
labels = tree.findall('.//label')
for t in titles :
Title.append(str(t.text))
print 'Titles: ', len(Title)
print Title
for y in years :
Year.append(str(y.text))
print 'years: ', len(Year)
print Year
for l in labels :
Label.append(str(l.text))
print 'Labels: ', len(Label)
print Label
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
for row in zip(Title, Year, Label):
ws.append(row)
wb.save("Test2.xlsx")
这是我根据查理的建议编写的代码,该代码不起作用。我收到一条错误消息 "TypeError: 'NoneType' object is not iterable"。我不确定是什么问题。 另外,如何在一个列表中获取每条记录的所有 3 个标签(标题、年份、标签)的文本,以及将如此大量的列表(200 本书的 200 个列表)写入 Excel 有多容易使用 openpylx?
import xml.etree.ElementTree as ET
fhand = open('My_Collection.xml')
data = fhand.read()
Label_lst=list()
for record in tree.find("records/record") :
label = record.find("label")
for l in label:
if label is not None: label = label_lst.append(label.text)
else:
label = label_lst.append(' ')
print label_lst
如果您想保留记录结构,您应该逐个记录地解析,而不是仅仅创建属性列表。您可以遍历记录并提取相关字段,或者 for record in parsed_xml.find("records/record"); label = record.find("label"); if label is not None: label = label.text
然后您可以将行直接写入 Excel,而无需压缩列。
我刚刚弄明白了。不过我仍然使用列。
from openpyxl import Workbook
import xml.etree.ElementTree as ET
fhand = open ('My_Collection')
tree =ET.parse('My_Collection.xml')
data= fhand.read()
root = tree.getroot()
tree = ET.fromstring(data)
title_list= ['Title']
year_list = ['Year']
author_list= ['Author']
label_list = ['Label']
for child in tree:
for children in child:
if children.find('.//title')is None :
t='N'
else:
t=children.find('.//title').text
title_list.append(t)
print title_list
print len(title_list)
for child in tree:
for children in child:
if children.find('.//year')is None :
y='N'
else:
y=children.find('.//year').text
year_list.append(y)
print year_list
print len(year_list)
for child in tree:
for children in child:
if children.find('.//author')is None :
a='N'
else:
a=children.find('.//author').text
author_list.append(a)
print author_list
print len(author_list)
for child in tree:
for children in child:
if children.find('label')is None :
l='N'
else:
l=children.find('label').text
label_list.append(l)
print label_list
print len(author_list)
for item in label_list:
wb = Workbook()
ws = wb.active
for row in zip(title_list, year_list, author_list, label_list):
ws.append(row)
wb.save("Test3.xlsx")