解析 XML 个不等数量的标签以生成等长的列表。 openpyxl 和 Beautifulsoup

Parsing XML of unequal number of tags to make lists of equal length. openpyxl and Beautifulsoup

我有一个 XML 文件,其中包含包含作者、出版日期、标签等标签的书籍记录。我要解析这个文件以制作 3 个列表,一个将有书名,另一个列表中有作者,最后是第三个列表中的标签,稍后我将使用 openpyxl 将这些列表写入 Excel 列. 问题是有些书籍记录没有标签。使用 Beautiful soup 的常规解析技术将生成长度相同的前两个列表,但标签列表的长度会更短。

我有三个问题:

1- 如何创建所有三个长度相等的列表(没有标签的书籍为空条目? 2- 标签列表如下所示 ['Energy;Green Buildings;High Performance Buildings'、'Computing'、'Computing;Design;Green Buildings'、......]。我创建了另外 15 列,标题为我拥有的标签名称,例如 "Computing" 和 "Design"。如果一本书包含特定标签,例如,如果第 5 行标题为 "Architecture" 的书有"Design" 标签,我需要在单元格(行'5',列'设计')中有一个 X 标记或彩色单元格。 3- 是否有更简单的方法来完成此操作(解析 XML 文件并在 Excel 中高效写入)?

这是 XML 文件和我编写的代码的快照(XML 文件和 Python 文件都可以从这里下载:http://www.ranialabib.com/#!python/icfwa

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    <records>
        <record>
            <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
            <ref-type name="Book">1</ref-type>
            <contributors>
                <authors>
                    <author>AIA Research Corporation</author>
                </authors>
            </contributors>
            <titles>
                <title>Regional guidelines for building passive energy conserving homes</title>
            </titles>
            <periodical/>
            <keywords/>
            <dates>
                <year>1978</year>
            </dates>
            <publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.</publisher>
            <urls/>
            <label>Energy;Green Buildings;High Performance Buildings</label>
        </record>
    <record>
            <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
            <ref-type name="Book">1</ref-type>
            <contributors>
                <authors>
                    <author>Akinci, Burcu</author>
                    <author>Ph, D</author>
                </authors>
            </contributors>
            <titles>
                <title>Computing in Civil Engineering</title>
            </titles>
            <periodical/>
            <pages>692-699</pages>
            <keywords/>
            <dates>
                <year>2007</year>
            </dates>
            <publisher>American Society of Civil Engineers</publisher>
            <isbn>9780784409374</isbn>
            <electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num>
            <urls>
                <web-urls>
                    <url>http://books.google.com/books?id=QigBgc-qgdoC</url>
                </web-urls>
            </urls>
        </record>



import xml.etree.ElementTree as ET
fhand = open('My_Collection.xml')
data = fhand.read()

Title=list()
Year=list()
Label=list()

tree = ET.fromstring(data)
titles = tree.findall('.//title')
years = tree.findall('.//year')
labels = tree.findall('.//label')


for t in titles : 
    Title.append(str(t.text))
print 'Titles: ', len(Title)
print Title

for y in years : 
    Year.append(str(y.text))
print 'years: ', len(Year)
print Year

for l in labels : 
    Label.append(str(l.text))
print 'Labels: ', len(Label)
print Label

from openpyxl import Workbook 
wb = Workbook() 
ws = wb.active 

for row in zip(Title, Year, Label): 
        ws.append(row) 

wb.save("Test2.xlsx") 

这是我根据查理的建议编写的代码,该代码不起作用。我收到一条错误消息 "TypeError: 'NoneType' object is not iterable"。我不确定是什么问题。 另外,如何在一个列表中获取每条记录的所有 3 个标签(标题、年份、标签)的文本,以及将如此大量的列表(200 本书的 200 个列表)写入 Excel 有多容易使用 openpylx?

import xml.etree.ElementTree as ET
fhand = open('My_Collection.xml')
data = fhand.read()
Label_lst=list()
for record in tree.find("records/record") :
    label = record.find("label")

for l in label:    
        if label is not None: label = label_lst.append(label.text)
    else:
        label = label_lst.append(' ') 
print label_lst

如果您想保留记录结构,您应该逐个记录地解析,而不是仅仅创建属性列表。您可以遍历记录并提取相关字段,或者 for record in parsed_xml.find("records/record"); label = record.find("label"); if label is not None: label = label.text 然后您可以将行直接写入 Excel,而无需压缩列。

我刚刚弄明白了。不过我仍然使用列。

from openpyxl import Workbook 
import xml.etree.ElementTree as ET



fhand = open ('My_Collection')    
tree =ET.parse('My_Collection.xml')
data= fhand.read()
root = tree.getroot()
tree = ET.fromstring(data)

title_list= ['Title']
year_list = ['Year']
author_list= ['Author']
label_list = ['Label']



for child in tree:
    for children in child:
        if children.find('.//title')is None :
            t='N'
        else:
            t=children.find('.//title').text
        title_list.append(t)
    print title_list
    print len(title_list)


for child in tree:
    for children in child:
        if children.find('.//year')is None :
            y='N'
        else:
            y=children.find('.//year').text
        year_list.append(y)
    print year_list
    print len(year_list)


for child in tree:
    for children in child:
        if children.find('.//author')is None :
            a='N'
        else:
            a=children.find('.//author').text
        author_list.append(a)
    print author_list
    print len(author_list)

for child in tree:
    for children in child:
        if children.find('label')is None :
            l='N'
        else:
            l=children.find('label').text
        label_list.append(l)
    print label_list
print len(author_list) 

for item in label_list:





wb = Workbook() 
ws = wb.active 

for row in zip(title_list, year_list, author_list, label_list): 
        ws.append(row) 

wb.save("Test3.xlsx")