R 中的 XPath:选择值
XPath in R: selecting values
我有一个 XML 文件,如下所示:
<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Sep 1, 2014 12:00:13 +0900 (GMT+09:00) -->
<pathway name="path:hsa04010" org="hsa" number="04010"
title="MAPK signaling pathway"
image="http://www.kegg.jp/kegg/pathway/hsa/hsa04010.png"
link="http://www.kegg.jp/kegg-bin/show_pathway?hsa04010">
<entry id="1" name="cpd:C00338" type="compound"
link="http://www.kegg.jp/dbget-bin/www_bget?C00338">
<graphics name="C00338" fgcolor="#000000" bgcolor="#FFFFFF"
type="circle" x="138" y="743" width="8" height="8"/>
</entry>
<entry id="2" name="hsa:5923 hsa:5924" type="gene"
link="http://www.kegg.jp/dbget-bin/www_bget?hsa:5923+hsa:5924">
<graphics name="RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..." fgcolor="#000000" bgcolor="#BFFFBF"
type="rectangle" x="392" y="236" width="46" height="17"/>
<relation entry1="47" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="46" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="45" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
我想做的是:
- 提取具有
type="gene"
的entry
children的所有id
和name
属性,并将它们存储在list/dictionary/dataframe中以后用。
- 提取
relation
children的所有属性,并存储在类似的结构中。
我刚刚开始 XML 解析,我一直在尝试阅读 Whosebug 中的其他问题以及网络上的各种常见问题解答,但我似乎无法理解工作。我可以根据上面的 (1) 执行以下操作和 select 所有节点:
data = xmlTreeParse('~/Downloads/hsa04010.xml')
root = xmlRoot(data)
getNodeSet(root, '/pathway/entry[@type="gene"]')
... 可以正常工作,但我不知道如何获取两个单独的值(在第二种情况下所有这些值)并将它们存储在某个地方。我试过了
getNodeSet(root, '/pathway/entry[@type="gene"]/@id')
...但这只会给我一个错误:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘saveXML’ for signature ‘"XMLAttributeValue"’
即使它有效,我也只会得到 id
属性,而不是我想要的 name
。但是看到我似乎无法获得甚至只有一个属性值,好吧......
你可以试试
lapply(data['/pathway/entry[@type="gene"]/@id | /pathway/entry[@type="gene"]/*//@name'], as, "character")
# [[1]]
# [1] "2"
#
# [[2]]
# [1] "RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..."
#
# [[3]]
# [1] "activation"
#
# [[4]]
# [1] "activation"
#
# [[5]]
# [1] "activation"
和
xpathApply(data, '/pathway/entry[@type="gene"]//relation', xmlAttrs)
# [[1]]
# entry1 entry2 type
# "47" "40" "PPrel"
#
# [[2]]
# entry1 entry2 type
# "46" "40" "PPrel"
#
# [[3]]
# entry1 entry2 type
# "45" "40" "PPrel
编辑:
data
是
data <- xmlParse('<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Sep 1, 2014 12:00:13 +0900 (GMT+09:00) -->
<pathway name="path:hsa04010" org="hsa" number="04010"
title="MAPK signaling pathway"
image="http://www.kegg.jp/kegg/pathway/hsa/hsa04010.png"
link="http://www.kegg.jp/kegg-bin/show_pathway?hsa04010">
<entry id="1" name="cpd:C00338" type="compound"
link="http://www.kegg.jp/dbget-bin/www_bget?C00338">
<graphics name="C00338" fgcolor="#000000" bgcolor="#FFFFFF"
type="circle" x="138" y="743" width="8" height="8"/>
</entry>
<entry id="2" name="hsa:5923 hsa:5924" type="gene"
link="http://www.kegg.jp/dbget-bin/www_bget?hsa:5923+hsa:5924">
<graphics name="RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..." fgcolor="#000000" bgcolor="#BFFFBF"
type="rectangle" x="392" y="236" width="46" height="17"/>
<relation entry1="47" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="46" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="45" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
</entry>
</pathway>', asText = TRUE)
KEGGgraph 包中有一个 KGML 解析器可能会有所帮助。查看小插图了解详情
library(KEGGgraph)
url <- "http://rest.kegg.jp/get/hsa04010/kgml"
x <- parseKGML(url)
您还可以直接解析 url,然后使用此处建议的不同 xpath 查询或类似 xmlAttrsToDataFrame 之类的东西,这在新的 XML for data sciences in R 书中进行了解释。
doc <- xmlParse(url)
genes <- XML:::xmlAttrsToDataFrame(doc["//entry[@type='gene']"])
relations <- XML:::xmlAttrsToDataFrame(doc["//relation"])
relations
entry1 entry2 type
1 47 40 PPrel
2 46 40 PPrel
3 45 40 PPrel
4 44 39 PPrel
5 43 38 PPrel
...
我有一个 XML 文件,如下所示:
<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Sep 1, 2014 12:00:13 +0900 (GMT+09:00) -->
<pathway name="path:hsa04010" org="hsa" number="04010"
title="MAPK signaling pathway"
image="http://www.kegg.jp/kegg/pathway/hsa/hsa04010.png"
link="http://www.kegg.jp/kegg-bin/show_pathway?hsa04010">
<entry id="1" name="cpd:C00338" type="compound"
link="http://www.kegg.jp/dbget-bin/www_bget?C00338">
<graphics name="C00338" fgcolor="#000000" bgcolor="#FFFFFF"
type="circle" x="138" y="743" width="8" height="8"/>
</entry>
<entry id="2" name="hsa:5923 hsa:5924" type="gene"
link="http://www.kegg.jp/dbget-bin/www_bget?hsa:5923+hsa:5924">
<graphics name="RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..." fgcolor="#000000" bgcolor="#BFFFBF"
type="rectangle" x="392" y="236" width="46" height="17"/>
<relation entry1="47" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="46" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="45" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
我想做的是:
- 提取具有
type="gene"
的entry
children的所有id
和name
属性,并将它们存储在list/dictionary/dataframe中以后用。 - 提取
relation
children的所有属性,并存储在类似的结构中。
我刚刚开始 XML 解析,我一直在尝试阅读 Whosebug 中的其他问题以及网络上的各种常见问题解答,但我似乎无法理解工作。我可以根据上面的 (1) 执行以下操作和 select 所有节点:
data = xmlTreeParse('~/Downloads/hsa04010.xml')
root = xmlRoot(data)
getNodeSet(root, '/pathway/entry[@type="gene"]')
... 可以正常工作,但我不知道如何获取两个单独的值(在第二种情况下所有这些值)并将它们存储在某个地方。我试过了
getNodeSet(root, '/pathway/entry[@type="gene"]/@id')
...但这只会给我一个错误:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘saveXML’ for signature ‘"XMLAttributeValue"’
即使它有效,我也只会得到 id
属性,而不是我想要的 name
。但是看到我似乎无法获得甚至只有一个属性值,好吧......
你可以试试
lapply(data['/pathway/entry[@type="gene"]/@id | /pathway/entry[@type="gene"]/*//@name'], as, "character")
# [[1]]
# [1] "2"
#
# [[2]]
# [1] "RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..."
#
# [[3]]
# [1] "activation"
#
# [[4]]
# [1] "activation"
#
# [[5]]
# [1] "activation"
和
xpathApply(data, '/pathway/entry[@type="gene"]//relation', xmlAttrs)
# [[1]]
# entry1 entry2 type
# "47" "40" "PPrel"
#
# [[2]]
# entry1 entry2 type
# "46" "40" "PPrel"
#
# [[3]]
# entry1 entry2 type
# "45" "40" "PPrel
编辑:
data
是
data <- xmlParse('<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Sep 1, 2014 12:00:13 +0900 (GMT+09:00) -->
<pathway name="path:hsa04010" org="hsa" number="04010"
title="MAPK signaling pathway"
image="http://www.kegg.jp/kegg/pathway/hsa/hsa04010.png"
link="http://www.kegg.jp/kegg-bin/show_pathway?hsa04010">
<entry id="1" name="cpd:C00338" type="compound"
link="http://www.kegg.jp/dbget-bin/www_bget?C00338">
<graphics name="C00338" fgcolor="#000000" bgcolor="#FFFFFF"
type="circle" x="138" y="743" width="8" height="8"/>
</entry>
<entry id="2" name="hsa:5923 hsa:5924" type="gene"
link="http://www.kegg.jp/dbget-bin/www_bget?hsa:5923+hsa:5924">
<graphics name="RASGRF1, CDC25, CDC25L, GNRP, GRF1, GRF55, H-GRF55, PP13187, ras-GRF1..." fgcolor="#000000" bgcolor="#BFFFBF"
type="rectangle" x="392" y="236" width="46" height="17"/>
<relation entry1="47" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="46" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
<relation entry1="45" entry2="40" type="PPrel">
<subtype name="activation" value="-->"/>
</relation>
</entry>
</pathway>', asText = TRUE)
KEGGgraph 包中有一个 KGML 解析器可能会有所帮助。查看小插图了解详情
library(KEGGgraph)
url <- "http://rest.kegg.jp/get/hsa04010/kgml"
x <- parseKGML(url)
您还可以直接解析 url,然后使用此处建议的不同 xpath 查询或类似 xmlAttrsToDataFrame 之类的东西,这在新的 XML for data sciences in R 书中进行了解释。
doc <- xmlParse(url)
genes <- XML:::xmlAttrsToDataFrame(doc["//entry[@type='gene']"])
relations <- XML:::xmlAttrsToDataFrame(doc["//relation"])
relations
entry1 entry2 type
1 47 40 PPrel
2 46 40 PPrel
3 45 40 PPrel
4 44 39 PPrel
5 43 38 PPrel
...