如何使用包含多个名称空间的 spark 读取 XML 文件?
How to read a XML file with spark that contains multiple namespaces?
我在 Azure-Databricks 中使用 spark-xml 库。但是我无法获得正确的选项来读取这种包含多个命名空间的文件。
所以我正在寻找一些帮助来在选项或任何其他方法中对此进行编码。
这是一个剥离的样本。
<msg:TrainTrackingMessage xmlns:msg="be:brail:nmbs-it:esb:msg:traintraffic" xmlns:trtf="be:brail:nmbs-it:esb:traintraffic" xmlns:gene="be:brail:nmbs-it:esb:generalelements">
<gene:Event>
<gene:EventType>tracking</gene:EventType>
<gene:EventMessage>TrainTracking</gene:EventMessage>
<gene:EventTimeStamp>2018-09-27T14:13:15.458439</gene:EventTimeStamp>
</gene:Event>
<gene:Train>
<gene:TrainKey>
<gene:CirculationType>1</gene:CirculationType>
<gene:Discriminator>0</gene:Discriminator>
<gene:DepartureDate>2018-09-27</gene:DepartureDate>
</gene:TrainKey>
<gene:TrainNumberEBP>2E0xaZ12</gene:TrainNumberEBP>
<gene:TrainDetails>
<gene:TrainGroup>1</gene:TrainGroup>
</gene:TrainDetails>
</gene:Train>
<trtf:TrainTracking>
<gene:ItineraryPoint>
<gene:PtcarIdentification>592</gene:PtcarIdentification>
<gene:OrderNumber>150</gene:OrderNumber>
<gene:ItineraryPointDetails>
<gene:OperationCode>=</gene:OperationCode>
<gene:CommercialStop>2</gene:CommercialStop>
</gene:ItineraryPointDetails>
<gene:ItineraryPointTimeInfo>
<gene:ArrivalTime>14:10:47</gene:ArrivalTime>
<gene:DepartureTime>14:10:54</gene:DepartureTime>
</gene:ItineraryPointTimeInfo>
<gene:ItineraryTechnicalInfo>
<gene:EngineType>21</gene:EngineType>
<gene:TractionCode>E</gene:TractionCode>
<gene:TractionOperator/>
</gene:ItineraryTechnicalInfo>
</gene:ItineraryPoint>
<trtf:GPSPosition>
<trtf:GPSAltitude>51</trtf:GPSAltitude>
</trtf:GPSPosition>
<trtf:Libelle>E2412</trtf:Libelle>
<trtf:TrackingPointInfo>
<trtf:TrackingType>2</trtf:TrackingType>
<trtf:TrackingOrigin>0</trtf:TrackingOrigin>
</trtf:TrackingPointInfo>
<trtf:TrackingTimeInfo>
<trtf:Delay>1639</trtf:Delay>
</trtf:TrackingTimeInfo>
</trtf:TrainTracking>
如果人们正在寻找熟悉的东西,这就成功了。
import xml.etree.ElementTree as ET
xmlfiles = dbutils.fs.ls(storage_mount_name)
##Get attribute names (for now I took all leafs of the xml structure)
firstfile = xmlfiles[0].path.replace('dbfs:','/dbfs')
root = ET.parse(firstfile).getroot()
attributes = [node.tag for node in root.iter() if len(node)==0]
clean_attribute_names = [re.sub(r'\{.*\}', '', a) for a in attributes]
#Create Dataframe and save it as csv
df = pd.DataFrame(columns=clean_attribute_names, index=xmlfiles)
for xf in xmlfiles:
afile = xf.path.replace('dbfs:','/dbfs')
root = ET.parse(afile).getroot()
df.loc[afile] = [node.text for node in root.iter() if node.tag in attributes]
我在 Azure-Databricks 中使用 spark-xml 库。但是我无法获得正确的选项来读取这种包含多个命名空间的文件。
所以我正在寻找一些帮助来在选项或任何其他方法中对此进行编码。
这是一个剥离的样本。
<msg:TrainTrackingMessage xmlns:msg="be:brail:nmbs-it:esb:msg:traintraffic" xmlns:trtf="be:brail:nmbs-it:esb:traintraffic" xmlns:gene="be:brail:nmbs-it:esb:generalelements">
<gene:Event>
<gene:EventType>tracking</gene:EventType>
<gene:EventMessage>TrainTracking</gene:EventMessage>
<gene:EventTimeStamp>2018-09-27T14:13:15.458439</gene:EventTimeStamp>
</gene:Event>
<gene:Train>
<gene:TrainKey>
<gene:CirculationType>1</gene:CirculationType>
<gene:Discriminator>0</gene:Discriminator>
<gene:DepartureDate>2018-09-27</gene:DepartureDate>
</gene:TrainKey>
<gene:TrainNumberEBP>2E0xaZ12</gene:TrainNumberEBP>
<gene:TrainDetails>
<gene:TrainGroup>1</gene:TrainGroup>
</gene:TrainDetails>
</gene:Train>
<trtf:TrainTracking>
<gene:ItineraryPoint>
<gene:PtcarIdentification>592</gene:PtcarIdentification>
<gene:OrderNumber>150</gene:OrderNumber>
<gene:ItineraryPointDetails>
<gene:OperationCode>=</gene:OperationCode>
<gene:CommercialStop>2</gene:CommercialStop>
</gene:ItineraryPointDetails>
<gene:ItineraryPointTimeInfo>
<gene:ArrivalTime>14:10:47</gene:ArrivalTime>
<gene:DepartureTime>14:10:54</gene:DepartureTime>
</gene:ItineraryPointTimeInfo>
<gene:ItineraryTechnicalInfo>
<gene:EngineType>21</gene:EngineType>
<gene:TractionCode>E</gene:TractionCode>
<gene:TractionOperator/>
</gene:ItineraryTechnicalInfo>
</gene:ItineraryPoint>
<trtf:GPSPosition>
<trtf:GPSAltitude>51</trtf:GPSAltitude>
</trtf:GPSPosition>
<trtf:Libelle>E2412</trtf:Libelle>
<trtf:TrackingPointInfo>
<trtf:TrackingType>2</trtf:TrackingType>
<trtf:TrackingOrigin>0</trtf:TrackingOrigin>
</trtf:TrackingPointInfo>
<trtf:TrackingTimeInfo>
<trtf:Delay>1639</trtf:Delay>
</trtf:TrackingTimeInfo>
</trtf:TrainTracking>
如果人们正在寻找熟悉的东西,这就成功了。
import xml.etree.ElementTree as ET
xmlfiles = dbutils.fs.ls(storage_mount_name)
##Get attribute names (for now I took all leafs of the xml structure)
firstfile = xmlfiles[0].path.replace('dbfs:','/dbfs')
root = ET.parse(firstfile).getroot()
attributes = [node.tag for node in root.iter() if len(node)==0]
clean_attribute_names = [re.sub(r'\{.*\}', '', a) for a in attributes]
#Create Dataframe and save it as csv
df = pd.DataFrame(columns=clean_attribute_names, index=xmlfiles)
for xf in xmlfiles:
afile = xf.path.replace('dbfs:','/dbfs')
root = ET.parse(afile).getroot()
df.loc[afile] = [node.text for node in root.iter() if node.tag in attributes]