如何从 XML 中获取包含缺失值的标签内容的 2 个列表?

How do I get 2 lists with tag contents from the XML that include missing values?

我有一个包含几千条记录的 XML 文件,我想从中提取:

我想得到库代码旁边所有城市的数据框。 但是如果库代码 (code="g") 不存在,那么我想要 NaN 或其他表明没有价值的东西。例如

df = {'Cities': [Berlin, London], 'Codes': [D-Bbbf, NaN]}

这是XML的一块:

<marc:record>
  <marc:controlfield tag="001">39612</marc:controlfield>
  <marc:controlfield tag="003">DE-633</marc:controlfield>
  <marc:controlfield tag="005">20161109000000.0</marc:controlfield>
  <marc:controlfield tag="008">161109n|||||||a|||              a</marc:controlfield>
  <marc:datafield tag="110" ind1="2" ind2=" ">
    <marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
    <marc:subfield code="c">Berlin</marc:subfield>
    <marc:subfield code="g">D-Bbbf</marc:subfield>
  </marc:datafield>
</marc:record><marc:record>
  <marc:controlfield tag="001">30006648</marc:controlfield>
  <marc:controlfield tag="003">DE-633</marc:controlfield>
  <marc:controlfield tag="005">20161109000000.0</marc:controlfield>
  <marc:datafield tag="110" ind1="2" ind2=" ">
    <marc:subfield code="a">The National Archives</marc:subfield>
    <marc:subfield code="c">London</marc:subfield>
  </marc:datafield>
</marc:record> 

这是我试过的:

# Import BeautifulSoup
from bs4 import BeautifulSoup

Data= {'Cities':[],
        'Code':[]}

# Read the XML file
with open('oefen.xml', 'r', encoding="utf8") as f_in:
    soup = BeautifulSoup(f_in.read(), 'html.parser')   
    
for record in soup.find_all(tag="110"):
    find = record.find_all('[code="g"]')

for code in record:
    if find is not None:
            City = record.select_one('[code="c"]') # select city
            Code = record.select_one('[code="g"]') # select code
            Data['Cities'].append(City.get_text(strip=True))
            Data['Code'].append(Code.get_text(strip=True))      
    else:
        print(NaN)
print(Data)

认为没有必要使用这些列表,使用一个字典列表更容易 - 在迭代记录时检查您要查找的元素是否可用于附加其文本或 None:

for record in soup.find_all('marc:record'):
    data.append({
        'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
        'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None  # select code
    })

例子

xml='''
<marc:record>
  <marc:controlfield tag="001">39612</marc:controlfield>
  <marc:controlfield tag="003">DE-633</marc:controlfield>
  <marc:controlfield tag="005">20161109000000.0</marc:controlfield>
  <marc:controlfield tag="008">161109n|||||||a|||              a</marc:controlfield>
  <marc:datafield tag="110" ind1="2" ind2=" ">
    <marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
    <marc:subfield code="c">Berlin</marc:subfield>
    <marc:subfield code="g">D-Bbbf</marc:subfield>
  </marc:datafield>
</marc:record><marc:record>
  <marc:controlfield tag="001">30006648</marc:controlfield>
  <marc:controlfield tag="003">DE-633</marc:controlfield>
  <marc:controlfield tag="005">20161109000000.0</marc:controlfield>
  <marc:datafield tag="110" ind1="2" ind2=" ">
    <marc:subfield code="a">The National Archives</marc:subfield>
    <marc:subfield code="c">London</marc:subfield>
  </marc:datafield>
</marc:record>
'''

# Import BeautifulSoup
from bs4 import BeautifulSoup

data = []

soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
    data.append({
        'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
        'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None  # select code
    })

pd.DataFrame(data)

输出

City Code
Berlin D-Bbbf
London None

编辑

如果您不使用最新的 python 版本,这将是检查 walrus operator 的替代方法:

...
data = []

soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
    try:
        city = record.select_one('[code="c"]').get_text(strip=True)
    except:
        city = None
    try:
        code = record.select_one('[code="g"]').get_text(strip=True)
    except:
        code = None
    data.append({
        'City' : city,
        'Code' : code
    })

pd.DataFrame(data)