了解 XML 文件的结构,为 CSV 或 RDF 转换做准备

Learn structure of XML file in preparation for CSV or RDF conversion

我想将 NCBI 的 Biosample Metadata XML file 转换为 CSV,或者 RDF/XML 作为第二选择。为此,我相信我必须更多地了解该文件的结构。我可以在 BaseX* 中 运行 basic XQueries,就像列出所有 <Id> 值一样,但后来我一直在使用 shell 工具,例如 sort|uniq -c 来计算它们。我听说过 XSLT 转换和 GRDDL 顺便说一句,但我认为没有为这个 XML 文档提供样式 sheet,我不知道如何创建或发现一个。

例如,我可以统计每个 <Id> 的数量吗?有没有 <BioSamples> 有一个以上的主要 <Id>?主要 ID 最常见的数据库属性是什么?

这是一个显示我目前 XQuery 复杂程度最高水平的查询:

let $sep := '|'
for $bs in doc('biosample_set')/BioSampleSet/BioSample
(: mutiple Id elements, potentially with db, is_primary and db_label attributes :) 
let $id := $bs/Ids/Id[@is_primary="1"]
(: description also has Comment/Paragraph elements :)
let $dt := $bs/Description/Title
let $ti := $bs/Description/Organism/@taxonomy_id
let $mm := $bs/Models/Model
  
return string-join(
       (
         data($id),
         data($dt),
         data($mm),
         data($ti)
       ),
       "|")

总而言之,如果能提供 XQuery 片段或其他帮助我的建议,我将不胜感激

或者对于单个 upvotable 任务:计算 <Id> 元素的每个数据库属性的出现次数并序列化为 CSV。

我看到一些美国和欧洲努力将 Biosample 元数据文档转换为 RDF,但它们似乎达不到 date/maintained/well-funded(即使它们来自备受推崇的团队)

*我也使用了 Exist-DB 和 Python 的 xml.etree.ElementTree 和相关的 lxml 方法,但是在加载或处理这个 46 GB(未打包)文件时遇到问题.

<?xml version="1.0" encoding="UTF-8"?>
<BioSampleSet>
<BioSample submission_date="2008-04-04T08:44:24.950" last_update="2019-06-20T16:11:22.271" publication_date="2008-04-04T00:00:00.000" access="public" id="2" accession="SAMN00000002">
  <Ids>
    <Id db="BioSample" is_primary="1">SAMN00000002</Id>
    <Id db="WUGSC" db_label="Sample name">19655</Id>
    <Id db="SRA">SRS000002</Id>
  </Ids>
  <Description>
    <Title>Alistipes putredinis DSM 17216</Title>
    <Organism taxonomy_id="445970" taxonomy_name="Alistipes putredinis DSM 17216"/>
    <Comment>
      <Paragraph>Alistipes putredinis (GenBank Accession Number for 16S rDNA gene: L16497) is a member of the Bacteroidetes division of the domain bacteria and has been isolated from human feces. It has been found in 16S rDNA sequence-based enumerations of the colonic microbiota of adult humans (Eckburg et. al. (2005), Ley et. al. (2006)). </Paragraph>
      <Paragraph>Keywords: GSC:MIxS;MIGS:5.0</Paragraph>
    </Comment>
  </Description>
  <Owner>
    <Name abbreviation="WUGSC">Washington University, Genome Sequencing Center</Name>
    <Contacts>
      <Contact email="lims@genome.wustl.edu"/>
    </Contacts>
  </Owner>
  <Models>
    <Model>MIGS.ba</Model>
  </Models>
  <Package display_name="MIGS: cultured bacteria/archaea; version 5.0">MIGS.ba.5.0</Package>
  <Attributes>
    <Attribute attribute_name="finishing strategy (depth of coverage)">Level 3: Improved-High-Quality Draft11.6x;20</Attribute>
    <Attribute attribute_name="collection date" harmonized_name="collection_date" display_name="collection date">not determined</Attribute>
    <Attribute attribute_name="estimated_size" harmonized_name="estimated_size" display_name="estimated size">2550000</Attribute>
    <Attribute attribute_name="sop">http://hmpdacc.org/doc/CommonGeneAnnotation_SOP.pdf</Attribute>
    <Attribute attribute_name="project_type">Reference Genome</Attribute>
    <Attribute attribute_name="host" harmonized_name="host" display_name="host">Homo sapiens</Attribute>
    <Attribute attribute_name="lat_lon" harmonized_name="lat_lon" display_name="latitude and longitude">not determined</Attribute>
    <Attribute attribute_name="biome" harmonized_name="env_broad_scale" display_name="broad-scale environmental context">terrestrial biome [ENVO:00000446]</Attribute>
    <Attribute attribute_name="misc_param: HMP body site">not determined</Attribute>
    <Attribute attribute_name="nucleic acid extraction">not determined</Attribute>
    <Attribute attribute_name="feature" harmonized_name="env_local_scale" display_name="local-scale environmental context">human-associated habitat [ENVO:00009003]</Attribute>
    <Attribute attribute_name="investigation_type" harmonized_name="investigation_type" display_name="investigation type">missing</Attribute>
    <Attribute attribute_name="host taxid" harmonized_name="host_taxid" display_name="host taxonomy ID">9606</Attribute>
    <Attribute attribute_name="project_name" harmonized_name="project_name" display_name="project name">Alistipes putredinis DSM 17216</Attribute>
    <Attribute attribute_name="assembly">PCAP</Attribute>
    <Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">not determined</Attribute>
    <Attribute attribute_name="source_mat_id" harmonized_name="source_material_id" display_name="source material identifiers">DSM 17216, CCUG 45780, CIP 104286, ATCC 29800, Carlier 10203, VPI 3293</Attribute>
    <Attribute attribute_name="material" harmonized_name="env_medium" display_name="environmental medium">biological product [ENVO:02000043]</Attribute>
    <Attribute attribute_name="ref_biomaterial" harmonized_name="ref_biomaterial" display_name="reference for biomaterial">not determined</Attribute>
    <Attribute attribute_name="misc_param: HMP supersite">gastrointestinal_tract</Attribute>
    <Attribute attribute_name="num_replicons" harmonized_name="num_replicons" display_name="number of replicons">not determined</Attribute>
    <Attribute attribute_name="sequencing method">454-GS20, Sanger</Attribute>
    <Attribute attribute_name="isol_growth_condt" harmonized_name="isol_growth_condt" display_name="isolation and growth condition">not determined</Attribute>
    <Attribute attribute_name="env_package" harmonized_name="env_package" display_name="environmental package">missing</Attribute>
    <Attribute attribute_name="strain" harmonized_name="strain" display_name="strain">DSM 17216</Attribute>
    <Attribute attribute_name="isolation-source" harmonized_name="isolation_source" display_name="isolation source">missing</Attribute>
    <Attribute attribute_name="type-material">type strain of Alistipes putredinis</Attribute>
  </Attributes>
  <Links>
    <Link type="url" label="DNA Source">http://www.dsmz.de/catalogues/details/culture/DSM-17216</Link>
    <Link type="entrez" target="bioproject">19655</Link>
  </Links>
  <Status status="live" when="2013-08-05T10:18:49"/>
</BioSample>

类似于我对 https://www.biostars.org/p/280581/ 的回答 使用我的工具 xsltstream:

$ wget -q -O - "http://ftp.ncbi.nlm.nih.gov//biosample/biosample_set.xml.gz" | gunzip -c | java -jar dist/xsltstream.jar -n BioSample -t ~/jeter.xsl |  head
SAMN00000002        SRS000002   Alistipes putredinis DSM 17216  445970  MIGS.ba 
SAMN00000003        SRS000003   Anaerotruncus colihominis DSM 17241 445972  MIGS.ba 
SAMN00000004        SRS000004   MIGS Cultured Bacterial/Archaeal sample from Bacteroides stercoris ATCC 43183   449673  MIGS.ba 
SAMN00000005        SRS000005   Generic sample from Biomphalaria glabrata   6526Generic 
SAMN00000006        SRS000006   Generic sample from Callithrix jacchus  9483    Generic 
SAMN00000007        SRS000007   Clostridium ramosum DSM 1402    445974  MIGS.ba 
SAMN00000008        SRS000008   MIGS Cultured Bacterial/Archaeal sample from Dorea formicigenerans ATCC 27755   411461  MIGS.ba 
SAMN00000009        SRS000009   Generic sample from Monodelphis domestica   13616Generic    
SAMN00000010        SRS000010   Generic sample from Ruminococcus sp. GM2/1  451639  Generic 
SAMN00000011        SRS000011   Generic sample from Roseburia faecis M72/1  451638  Generic 

与“jeter.xsl”

<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:output method="text"  encoding="UTF-8"/>
<xsl:template match="BioSample">
<xsl:value-of select="Ids/Id[@db='BioSample']/text()"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="Ids/Id[@db='UGAML']/text()"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="Ids/Id[@db='SRA']/text()"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="Description/Title/text()"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="Description/Organism/@taxonomy_id"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="Models/Model/text()"/>
<xsl:text>  </xsl:text>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>