使用 xml.etree.ElementTree 解析某些元素时出现问题

Problem to parse some elements with xml.etree.ElementTree

希望你一切都好。我面临一些与我的解析器相关的困难。事实上,我的数据集看起来像这样:

<?xml version="1.0"?>

<bugrepository name="AspectJ">
  <bug id="28974" opendate="2003-1-3 10:28:00" fixdate="2003-1-14 14:30:00">
    <buginformation>
      <summary>"Compiler error when introducing a ""final"" field"</summary>
      <description>The aspecs the problem...</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/AjcMemberMaker.java</file>
    </fixedFiles>
  </bug>

  <bug id="28919" opendate="2002-12-30 16:40:00" fixdate="2003-1-14 15:06:00">
    <buginformation>
      <summary>waever tries to weave into native methods ...</summary>
      <description>If youat org.aspectj.ajdt.internal.core.burce</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/LazyMethodGen.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29186" opendate="2003-1-8 21:22:00" fixdate="2003-1-14 16:43:00">
    <buginformation>
      <summary>ajc -emacssym chokes on pointcut that includes an intertype method</summary>
      <description>This ;void Foo.ajc$before$Foo</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Lint.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Shadow.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/BcelWeaver.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29769" opendate="2003-1-19 11:42:00" fixdate="2003-1-24 21:17:00">
    <buginformation>
      <summary>Ajde does not support new AspectJ 1.1 compiler options</summary>
      <description>The org.aspectj.ajpiler. This enhancement is needed byort.</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/ajde/testdata/examples/figures-coverage/figures/Figure.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/AjdeTests.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/ui/StructureViewManagerTest.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/ajc/BuildArgParser.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/core/builder/AjBuildConfig.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/testsrc/org/aspectj/ajdt/ajc/BuildArgParserTestCase.java</file>
    </fixedFiles>
  </bug>
  <bug id="29959" opendate="2003-1-22 7:10:00" fixdate="2003-2-13 16:00:00">
    <buginformation>
      <summary>super call in intertype method declaration body causes VerifyError</summary>
      <description>AspectJ Compiler 1.1 showstopper</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/compiler/ast/InterTypeConstructorDeclaration.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/ast/SuperFixerVisitor.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/lookup/InterTypeMethodBinding.java</file>
      <file>org.aspectj/modules/tests/bugs/SuperToIntro.java</file>
    </fixedFiles>
  </bug>
</bugrepository>

我希望能够恢复数据集的一些元素,以便将它们与数据帧中的 Pandas 一起使用。

第一个问题是以列表形式从标签中取出所有子元素

实际上我的代码只检索第一个元素并忽略其他元素,或者可以检索所有元素但没有像您在这些图片中看到的那样结构化: here only the empty ([]) lists without content

代码:

import pandas as pd 
from xml.etree.ElementTree import parse

document = parse('dataset.xml')
summary = []
description = []
fixedfile = []

for item in document.iterfind('bug'):
    summary.append(item.findtext('buginformation/summary'))
    description.append(item.findtext('buginformation/description'))
    fixedfile.append(item.findall('fixedFiles/file'))
    
#df = pd.DataFrame({'summary':summary, 'description':description, 'fixed_files':fixedfile})
df = pd.DataFrame({'fixed_files': fixedfile})
df

here only the first element

代码:

import pandas as pd 
from xml.etree.ElementTree import parse

document = parse('dataset.xml')
summary = []
description = []
fixedfile = []

for item in document.iterfind('bug'):
    summary.append(item.findtext('buginformation/summary'))
    description.append(item.findtext('buginformation/description'))
    fixedfile.append(item.findtext('fixedFiles/file'))
    
#df = pd.DataFrame({'summary':summary, 'description':description, 'fixed_files':fixedfile})
df = pd.DataFrame({'fixed_files': fixedfile})
df

我在这里 找到了适合我的情况的解决方案,它有效但不像我想要的那样(每个元素的列表列表),我可以单独加载所有元素。

代码:

import xml.etree.ElementTree as ET
import pandas as pd 

xmldoc = ET.parse('dataset.xml')
root = xmldoc.getroot()
summary = []
description = []
fixedfile = []

for bug in xmldoc.iter(tag='bug'): 
    
    #for item in document.iterfind('bug'):
    #summary.append(item.findtext('buginformation/summary'))
    #description.append(item.findtext('buginformation/description'))
    
    for file in bug.iterfind('./fixedFiles/file'):
    
           fixedfile.append([file.text])
        
fixedfile
#df = pd.DataFrame({'summary':summary, 'description':description, 'fixed_files':fixedfile})
df = pd.DataFrame({'fixed_files': fixedfile})
df

当我想迭代数据框的其他列(摘要、描述)时,我收到以下错误消息: ValueError:所有数组的长度必须相同

第二个问题,能够select例如所有具有 2 或 3 个子元素的标签。

此致,

要将文件保存在与描述和摘要相关联的列表中,请将它们添加到每个错误的新列表中。

尝试:

import pandas as pd
from xml.etree.ElementTree import parse

document = parse('dataset.xml')
summary = []
description = []
fixedfile = []

for item in document.iterfind('bug'):
    summary.append(item.findtext('buginformation/summary'))
    description.append(item.findtext('buginformation/description'))
    fixedfile.append([elt.text for elt in item.findall('fixedFiles/file')])

df = pd.DataFrame({'summary': summary,
                   'description': description,
                   'fixed_files': fixedfile})
df

对于第二部分,这将仅过滤那些具有两个或更多文件的错误。

newdf = df[df.fixed_files.str.len() >= 2]

如果想要正好有 2 个和 3 个文件的错误,那么:

newdf = df[(df.fixed_files.str.len() == 2) | (df.fixed_files.str.len() == 3)]

以下收集资料。这个想法是找到所有 bug 元素并迭代它们。对于每个 bug - 查找所需的子元素。

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version="1.0"?>

<bugrepository name="AspectJ">
  <bug id="28974" opendate="2003-1-3 10:28:00" fixdate="2003-1-14 14:30:00">
    <buginformation>
      <summary>"Compiler error when introducing a ""final"" field"</summary>
      <description>The aspecs the problem...</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/AjcMemberMaker.java</file>
    </fixedFiles>
  </bug>

  <bug id="28919" opendate="2002-12-30 16:40:00" fixdate="2003-1-14 15:06:00">
    <buginformation>
      <summary>waever tries to weave into native methods ...</summary>
      <description>If youat org.aspectj.ajdt.internal.core.burce</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/LazyMethodGen.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29186" opendate="2003-1-8 21:22:00" fixdate="2003-1-14 16:43:00">
    <buginformation>
      <summary>ajc -emacssym chokes on pointcut that includes an intertype method</summary>
      <description>This ;void Foo.ajc$before$Foo</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Lint.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/Shadow.java</file>
      <file>org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/BcelWeaver.java</file>
    </fixedFiles>
  </bug>
  
  <bug id="29769" opendate="2003-1-19 11:42:00" fixdate="2003-1-24 21:17:00">
    <buginformation>
      <summary>Ajde does not support new AspectJ 1.1 compiler options</summary>
      <description>The org.aspectj.ajpiler. This enhancement is needed byort.</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/ajde/testdata/examples/figures-coverage/figures/Figure.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/AjdeTests.java</file>
      <file>org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/ui/StructureViewManagerTest.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/ajc/BuildArgParser.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/core/builder/AjBuildConfig.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/testsrc/org/aspectj/ajdt/ajc/BuildArgParserTestCase.java</file>
    </fixedFiles>
  </bug>
  <bug id="29959" opendate="2003-1-22 7:10:00" fixdate="2003-2-13 16:00:00">
    <buginformation>
      <summary>super call in intertype method declaration body causes VerifyError</summary>
      <description>AspectJ Compiler 1.1 showstopper</description>
    </buginformation>
    <fixedFiles>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/compiler/ast/InterTypeConstructorDeclaration.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/ast/SuperFixerVisitor.java</file>
      <file>org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/lookup/InterTypeMethodBinding.java</file>
      <file>org.aspectj/modules/tests/bugs/SuperToIntro.java</file>
    </fixedFiles>
  </bug>
  </bugrepository>'''

data = []
root = ET.fromstring(xml)
for bug in root.findall('.//bug'):
    bug_info = bug.find('buginformation')
    fixed_files = bug.find('fixedFiles')
    entry = {'summary': bug_info.find('summary').text,'description':bug_info.find('summary').text,'fixedFiles':[x.text for x in list(fixed_files)]}
    data.append(entry)
for entry in data:
    print(entry)
df = pd.DataFrame(data)

输出

{'summary': '"Compiler error when introducing a ""final"" field"', 'description': '"Compiler error when introducing a ""final"" field"', 'fixedFiles': ['org.aspectj/modules/weaver/src/org/aspectj/weaver/AjcMemberMaker.java']}
{'summary': 'waever tries to weave into native methods ...', 'description': 'waever tries to weave into native methods ...', 'fixedFiles': ['org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/LazyMethodGen.java']}
{'summary': 'ajc -emacssym chokes on pointcut that includes an intertype method', 'description': 'ajc -emacssym chokes on pointcut that includes an intertype method', 'fixedFiles': ['org.aspectj/modules/weaver/src/org/aspectj/weaver/Lint.java', 'org.aspectj/modules/weaver/src/org/aspectj/weaver/Shadow.java', 'org.aspectj/modules/weaver/src/org/aspectj/weaver/bcel/BcelWeaver.java']}
{'summary': 'Ajde does not support new AspectJ 1.1 compiler options', 'description': 'Ajde does not support new AspectJ 1.1 compiler options', 'fixedFiles': ['org.aspectj/modules/ajde/testdata/examples/figures-coverage/figures/Figure.java', 'org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/AjdeTests.java', 'org.aspectj/modules/ajde/testsrc/org/aspectj/ajde/ui/StructureViewManagerTest.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/ajc/BuildArgParser.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/core/builder/AjBuildConfig.java', 'org.aspectj/modules/org.aspectj.ajdt.core/testsrc/org/aspectj/ajdt/ajc/BuildArgParserTestCase.java']}
{'summary': 'super call in intertype method declaration body causes VerifyError', 'description': 'super call in intertype method declaration body causes VerifyError', 'fixedFiles': ['org.aspectj/modules/org.aspectj.ajdt.core/src/org/compiler/ast/InterTypeConstructorDeclaration.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/ast/SuperFixerVisitor.java', 'org.aspectj/modules/org.aspectj.ajdt.core/src/org/aspectj/ajdt/internal/compiler/lookup/InterTypeMethodBinding.java', 'org.aspectj/modules/tests/bugs/SuperToIntro.java']}