如何在 XML 文件中获取正确的文本元素

Question

我有以下 xml child 看起来如下：

<RollCallVote.Description.Text>
Agence européenne des médicaments - European Medicines Agency - Europäische Arzneimittel-Agentur -
<a href="#reds:iPlRp/A-9-2021-0216" data-rel="reds" redmap-uri="/reds:iPlRp/A-9-2021-0216">A9-0216/2021</a>
- Nicolás González Casares - Accord provisoire - Am 156
</RollCallVote.Description.Text>

现在我尝试获取以下两个元素，redcap-uri 文本 A9-0216/2021 和后面的文本 Nicolás González Casares - Accord provisoire - Am 156，最好是在两个 pandas 数据框列中。

不幸的是，

for adescription in avote.iter('RollCallVote.Description.Text'):
        description = adescription.get('a')

最后的数据框列只给我 redcap-uri 而不是它的文本。也将其更改为 description = adescription.get('a').text 不起作用，因为我收到错误消息 AttributeError: 'str' object has no attribute 'text'

如果我用

for adescription in avote.iter('RollCallVote.Description.Text'):
        description = adescription.text

文本的开头，意思是Agence européenne des médicaments - European Medicines Agency - Europäische Arzneimittel-Agentur -，显示在最后，但没有其他内容。

有人可以帮忙解决这个问题吗？

Answer 1

如果你只是想从well-formattedxml列中的代码中提取一些文本（总是相同的结构），你也可以尝试使用正则表达式而不是[=28]来解决它=]解析器：

import pandas as pd

df = pd.DataFrame({"text": """<RollCallVote.Description.Text>
Agence européenne des médicaments - European Medicines Agency - Europäische Arzneimittel-Agentur -
<a href="#reds:iPlRp/A-9-2021-0216" data-rel="reds" redmap-uri="/reds:iPlRp/A-9-2021-0216">A9-0216/2021</a>
- Nicolás González Casares - Accord provisoire - Am 156
</RollCallVote.Description.Text>
"""}, index=[0])

df["text"].str.extract(r'(.*?)\n<a .*>(.*?)<\/a>\n(.*?)\n<\/RollCallVote\.Description\.Text>', expand=True)

我只是拿了文本并将有趣的部分替换为 (.*?) 捕获组（. 是任何字符，* 是任何数量，所以 .* 是任何char，任意数量。您可以跳过 .* 之后的 ?，但我只是想确保它不会捕获所有内容，这正是 ? 所做的。()中括号是一个捕获组，通过.str.extract)转成列并加换行（\n）转义（加\）点（需要告诉正则表达式，如果点是点而不是任意字符，则写 \.).

Answer 2

使用ElementTree时，可以提取感兴趣的文本如下：

import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd

content = '''<root>
<RollCallVote.Description.Text>
Agence européenne des médicaments - European Medicines Agency - Europäische Arzneimittel-Agentur -
<a href="#reds:iPlRp/A-9-2021-0216" data-rel="reds" redmap-uri="/reds:iPlRp/A-9-2021-0216">A9-0216/2021</a>
- Nicolás González Casares - Accord provisoire - Am 156
</RollCallVote.Description.Text>
<RollCallVote.Description.Text>
Agence européenne des médicaments - European Medicines Agency - Europäische Arzneimittel-Agentur -
<a href="#reds:iPlRp/A-9-2021-0216" data-rel="reds" redmap-uri="/reds:iPlRp/A-9-2021-0216">A9-0217/2021</a>
- Nicolás González Casares - Accord provisoire - Am 157
</RollCallVote.Description.Text>
<RollCallVote.Description.Text>
Agence européenne des médicaments - European Medicines Agency - Europäische Arzneimittel-Agentur -
<a href="#reds:iPlRp/A-9-2021-0216" data-rel="reds" redmap-uri="/reds:iPlRp/A-9-2021-0216">A9-0218/2021</a>
- Nicolás González Casares - Accord provisoire - Am 158
</RollCallVote.Description.Text>
</root>'''

# create a dict to store temporary data
dct = defaultdict(list)

# parse XML
root = ET.fromstring(content)

# find elements with the given name
elements = root.findall('RollCallVote.Description.Text')

# iterate over elements found
for element in elements:
  # search for 'a' element having attribute 'redmap-uri'
  # if there is only one 'a' child element predictae can be omitted
  link = element.find('a[@redmap-uri]')
  if link:
      dct['link'].append(link.text)
      dct['text'].append(link.tail)

# construct pandas dataframe from a dict
df = pd.DataFrame(dct)

df

如何在 XML 文件中获取正确的文本元素

How to get correct text element in XML file

python

xml

parsing

pandas