如何从 XML 中获取包含缺失值的标签内容的 2 个列表?
How do I get 2 lists with tag contents from the XML that include missing values?
我有一个包含几千条记录的 XML 文件,我想从中提取:
- 城市:标签 110 代码 c(例如柏林)
- 图书馆代码:标签110代码g(例如D-Bbbf)
我想得到库代码旁边所有城市的数据框。 但是如果库代码 (code="g") 不存在,那么我想要 NaN 或其他表明没有价值的东西。例如
df = {'Cities': [Berlin, London], 'Codes': [D-Bbbf, NaN]}
这是XML的一块:
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
这是我试过的:
# Import BeautifulSoup
from bs4 import BeautifulSoup
Data= {'Cities':[],
'Code':[]}
# Read the XML file
with open('oefen.xml', 'r', encoding="utf8") as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')
for record in soup.find_all(tag="110"):
find = record.find_all('[code="g"]')
for code in record:
if find is not None:
City = record.select_one('[code="c"]') # select city
Code = record.select_one('[code="g"]') # select code
Data['Cities'].append(City.get_text(strip=True))
Data['Code'].append(Code.get_text(strip=True))
else:
print(NaN)
print(Data)
认为没有必要使用这些列表,使用一个字典列表更容易 - 在迭代记录时检查您要查找的元素是否可用于附加其文本或 None
:
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
例子
xml='''
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
'''
# Import BeautifulSoup
from bs4 import BeautifulSoup
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
pd.DataFrame(data)
输出
City
Code
Berlin
D-Bbbf
London
None
编辑
如果您不使用最新的 python 版本,这将是检查 walrus operator
的替代方法:
...
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
try:
city = record.select_one('[code="c"]').get_text(strip=True)
except:
city = None
try:
code = record.select_one('[code="g"]').get_text(strip=True)
except:
code = None
data.append({
'City' : city,
'Code' : code
})
pd.DataFrame(data)
我有一个包含几千条记录的 XML 文件,我想从中提取:
- 城市:标签 110 代码 c(例如柏林)
- 图书馆代码:标签110代码g(例如D-Bbbf)
我想得到库代码旁边所有城市的数据框。 但是如果库代码 (code="g") 不存在,那么我想要 NaN 或其他表明没有价值的东西。例如
df = {'Cities': [Berlin, London], 'Codes': [D-Bbbf, NaN]}
这是XML的一块:
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
这是我试过的:
# Import BeautifulSoup
from bs4 import BeautifulSoup
Data= {'Cities':[],
'Code':[]}
# Read the XML file
with open('oefen.xml', 'r', encoding="utf8") as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')
for record in soup.find_all(tag="110"):
find = record.find_all('[code="g"]')
for code in record:
if find is not None:
City = record.select_one('[code="c"]') # select city
Code = record.select_one('[code="g"]') # select code
Data['Cities'].append(City.get_text(strip=True))
Data['Code'].append(Code.get_text(strip=True))
else:
print(NaN)
print(Data)
认为没有必要使用这些列表,使用一个字典列表更容易 - 在迭代记录时检查您要查找的元素是否可用于附加其文本或 None
:
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
例子
xml='''
<marc:record>
<marc:controlfield tag="001">39612</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:controlfield tag="008">161109n|||||||a||| a</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">Bibliothek für Bildungsgeschichtliche Forschung</marc:subfield>
<marc:subfield code="c">Berlin</marc:subfield>
<marc:subfield code="g">D-Bbbf</marc:subfield>
</marc:datafield>
</marc:record><marc:record>
<marc:controlfield tag="001">30006648</marc:controlfield>
<marc:controlfield tag="003">DE-633</marc:controlfield>
<marc:controlfield tag="005">20161109000000.0</marc:controlfield>
<marc:datafield tag="110" ind1="2" ind2=" ">
<marc:subfield code="a">The National Archives</marc:subfield>
<marc:subfield code="c">London</marc:subfield>
</marc:datafield>
</marc:record>
'''
# Import BeautifulSoup
from bs4 import BeautifulSoup
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
data.append({
'City' : e.get_text(strip=True) if (e := record.select_one('[code="c"]')) else None, # select city
'Code' : e.get_text(strip=True) if (e := record.select_one('[code="g"]')) else None # select code
})
pd.DataFrame(data)
输出
City | Code |
---|---|
Berlin | D-Bbbf |
London | None |
编辑
如果您不使用最新的 python 版本,这将是检查 walrus operator
的替代方法:
...
data = []
soup = BeautifulSoup(xml,'lxml')
for record in soup.find_all('marc:record'):
try:
city = record.select_one('[code="c"]').get_text(strip=True)
except:
city = None
try:
code = record.select_one('[code="g"]').get_text(strip=True)
except:
code = None
data.append({
'City' : city,
'Code' : code
})
pd.DataFrame(data)