从 R 中的 NCBI entrez 解析 xml
Parsing xml from a NCBI entrez in R
我想从 NCBI 条目的特征部分提取一些信息,我正在使用此代码。
下载数据
fetch2 <- entrez_fetch(db = "nucleotide", id = 1028916732,
rettype = "gbc", retmode="xml", parsed = TRUE)
分析数据
xmltop = xmlRoot(fetch2) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop)
xmlSize(xmltop)
xmlName(xmltop[[1]])
features <- xmltop[[1]][[20]][[1]][[4]]
我只对功能感兴趣
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>organism</INSDQualifier_name>
<INSDQualifier_value>Alanphillipsia aloeigena</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>mol_type</INSDQualifier_name>
<INSDQualifier_value>genomic DNA</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>strain</INSDQualifier_name>
<INSDQualifier_value>CPC 21286</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>isolation_source</INSDQualifier_name>
<INSDQualifier_value>leaves</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>host</INSDQualifier_name>
<INSDQualifier_value>Aloe melanacantha</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>culture_collection</INSDQualifier_name>
<INSDQualifier_value>CBS:136408</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>culture_collection</INSDQualifier_name>
<INSDQualifier_value>CPC:21286</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>type_material</INSDQualifier_name>
<INSDQualifier_value>culture from holotype of Alanphillipsia aloeigena</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>db_xref</INSDQualifier_name>
<INSDQualifier_value>taxon:1414674</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>country</INSDQualifier_name>
<INSDQualifier_value>South Africa: Namakwaland, Koegap Nature Reserve</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>collected_by</INSDQualifier_name>
<INSDQualifier_value>M.J. Wingfield</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>note</INSDQualifier_name>
<INSDQualifier_value>ex-holotype culture of Alanphillipsia aloeigena</INSDQualifier_value>
</INSDQualifier>
</INSDFeature_quals>
我想创建一个table喜欢
Organism | culture_collection | host
Alanphillipsia aloeigena | CBS:136408 | Aloe melanacantha
但是我不明白如何使用
检索数据
<INSDQualifier_name>
<INSDQualifier_value>
我看过 Pubmed 的一些教程,效果很好,但输出结构不同。
最后,我想创建一个循环以从 ID 列表中提取数据,并且由于并非所有条目都具有相同的结构,所以我想使用 host
organism
之类的标签来检索该信息。
由于您的 XML 相当平坦,请考虑 XML 的便捷处理程序,xmlToDataFrame
:
library(XML)
fetch2 <- ...
doc <- xmlParse(fetch2)
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//INSDQualifier"))
df
# INSDQualifier_name INSDQualifier_value
# 1 organism Alanphillipsia aloeigena
# 2 mol_type genomic DNA
# 3 strain CPC 21286
# 4 isolation_source leaves
# 5 host Aloe melanacantha
# 6 culture_collection CBS:136408
# 7 culture_collection CPC:21286
# 8 type_material culture from holotype of Alanphillipsia aloeigena
# 9 db_xref taxon:1414674
# 10 country South Africa: Namakwaland, Koegap Nature Reserve
# 11 collected_by M.J. Wingfield
# 12 note ex-holotype culture of Alanphillipsia aloeigena
然后 运行 如果上面的每一行都应该是具有相应值的列,则使用列名和行名清理转置
final_df <- data.frame(t(df), stringsAsFactors = FALSE)
colnames(final_df) <- as.character(final_df[1,])
final_df <- final_df[-1,]
rownames(final_df) <- NULL
final_df
# organism mol_type strain isolation_source host culture_collection culture_collection type_material
# 1 Alanphillipsia aloeigena genomic DNA CPC 21286 leaves Aloe melanacantha CBS:136408 CPC:21286 culture from holotype of Alanphillipsia aloeigena
# db_xref country collected_by note
# 1 taxon:1414674 South Africa: Namakwaland, Koegap Nature Reserve M.J. Wingfield ex-holotype culture of Alanphillipsia aloeigena
我想从 NCBI 条目的特征部分提取一些信息,我正在使用此代码。 下载数据
fetch2 <- entrez_fetch(db = "nucleotide", id = 1028916732,
rettype = "gbc", retmode="xml", parsed = TRUE)
分析数据
xmltop = xmlRoot(fetch2) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop)
xmlSize(xmltop)
xmlName(xmltop[[1]])
features <- xmltop[[1]][[20]][[1]][[4]]
我只对功能感兴趣
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>organism</INSDQualifier_name>
<INSDQualifier_value>Alanphillipsia aloeigena</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>mol_type</INSDQualifier_name>
<INSDQualifier_value>genomic DNA</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>strain</INSDQualifier_name>
<INSDQualifier_value>CPC 21286</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>isolation_source</INSDQualifier_name>
<INSDQualifier_value>leaves</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>host</INSDQualifier_name>
<INSDQualifier_value>Aloe melanacantha</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>culture_collection</INSDQualifier_name>
<INSDQualifier_value>CBS:136408</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>culture_collection</INSDQualifier_name>
<INSDQualifier_value>CPC:21286</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>type_material</INSDQualifier_name>
<INSDQualifier_value>culture from holotype of Alanphillipsia aloeigena</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>db_xref</INSDQualifier_name>
<INSDQualifier_value>taxon:1414674</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>country</INSDQualifier_name>
<INSDQualifier_value>South Africa: Namakwaland, Koegap Nature Reserve</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>collected_by</INSDQualifier_name>
<INSDQualifier_value>M.J. Wingfield</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>note</INSDQualifier_name>
<INSDQualifier_value>ex-holotype culture of Alanphillipsia aloeigena</INSDQualifier_value>
</INSDQualifier>
</INSDFeature_quals>
我想创建一个table喜欢
Organism | culture_collection | host
Alanphillipsia aloeigena | CBS:136408 | Aloe melanacantha
但是我不明白如何使用
检索数据<INSDQualifier_name>
<INSDQualifier_value>
我看过 Pubmed 的一些教程,效果很好,但输出结构不同。
最后,我想创建一个循环以从 ID 列表中提取数据,并且由于并非所有条目都具有相同的结构,所以我想使用 host
organism
之类的标签来检索该信息。
由于您的 XML 相当平坦,请考虑 XML 的便捷处理程序,xmlToDataFrame
:
library(XML)
fetch2 <- ...
doc <- xmlParse(fetch2)
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//INSDQualifier"))
df
# INSDQualifier_name INSDQualifier_value
# 1 organism Alanphillipsia aloeigena
# 2 mol_type genomic DNA
# 3 strain CPC 21286
# 4 isolation_source leaves
# 5 host Aloe melanacantha
# 6 culture_collection CBS:136408
# 7 culture_collection CPC:21286
# 8 type_material culture from holotype of Alanphillipsia aloeigena
# 9 db_xref taxon:1414674
# 10 country South Africa: Namakwaland, Koegap Nature Reserve
# 11 collected_by M.J. Wingfield
# 12 note ex-holotype culture of Alanphillipsia aloeigena
然后 运行 如果上面的每一行都应该是具有相应值的列,则使用列名和行名清理转置
final_df <- data.frame(t(df), stringsAsFactors = FALSE)
colnames(final_df) <- as.character(final_df[1,])
final_df <- final_df[-1,]
rownames(final_df) <- NULL
final_df
# organism mol_type strain isolation_source host culture_collection culture_collection type_material
# 1 Alanphillipsia aloeigena genomic DNA CPC 21286 leaves Aloe melanacantha CBS:136408 CPC:21286 culture from holotype of Alanphillipsia aloeigena
# db_xref country collected_by note
# 1 taxon:1414674 South Africa: Namakwaland, Koegap Nature Reserve M.J. Wingfield ex-holotype culture of Alanphillipsia aloeigena