从 R xml2 中的单个 XML 节点集中隔离数据
Isolating data from single XML nodeset in R xml2
我正在尝试从 XML 文档中迭代地隔离和操作节点集,但是我在 R 中的 xml2 包中的 xml_find_all() 函数中发现了一个奇怪的行为。有人可以吗帮助我了解应用于节点集的功能范围?
这是一个例子:
library( xml2 )
library( dplyr )
doc <- read_xml( "<MEMBERS>
<CUSTOMER>
<ID>178</ID>
<FIRST.NAME>Alvaro</FIRST.NAME>
<LAST.NAME>Juarez</LAST.NAME>
<ADDRESS>123 Park Ave</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
<CUSTOMER>
<ID>934</ID>
<FIRST.NAME>Janette</FIRST.NAME>
<LAST.NAME>Johnson</LAST.NAME>
<ADDRESS>456 Candy Ln</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
</MEMBERS>" )
doc %>% xml_find_all( '//*') %>% xml_path()
# [1] "/MEMBERS" "/MEMBERS/CUSTOMER[1]"
# [3] "/MEMBERS/CUSTOMER[1]/ID" "/MEMBERS/CUSTOMER[1]/FIRST.NAME"
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS"
# [7] "/MEMBERS/CUSTOMER[1]/ZIP" "/MEMBERS/CUSTOMER[2]"
# [9] "/MEMBERS/CUSTOMER[2]/ID" "/MEMBERS/CUSTOMER[2]/FIRST.NAME"
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS"
#[13] "/MEMBERS/CUSTOMER[2]/ZIP"
对象 customer.01 是一个仅包含来自该客户的数据的节点集。
kids <- xml_children( doc )
customer.01 <- kids[[1]]
customer.01
# {xml_node}
# <CUSTOMER>
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>
为什么函数应用于 customer.01 节点集,return customer.02 的 ID 也是如此?
xml_find_all( customer.01, "//MEMBERS/CUSTOMER/ID" )
# {xml_nodeset (2)}
# [1] <ID>178</ID>
# [2] <ID>934</ID>
如何 return 仅来自该节点集的值?
~~~
好的,所以下面的解决方案中有一个小问题,同样与 xml_find_all() 函数的范围有关。它说它可以应用于文档、节点或节点集。然而...
这种情况适用于节点集:
library( xml2 )
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml"
doc <- read_xml( url )
xml_ns_strip( doc )
nd <- xml_find_all( doc, "//LiquidationOfAssetsDetail|//LiquidationDetail" )
nodei <- nd[[1]]
nodei
# {xml_node}
# <LiquidationOfAssetsDetail>
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# [1] "LAND"
但不是这个:
nodei <- xml_children( nd[[i]] )
nodei
# {xml_nodeset (7)}
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# character(0)
我猜这是将 xml_find_all() 应用于节点集的所有元素的问题,而不是范围问题?
目前,您正在使用 XPath 的双正斜杠 //
从根目录开始的绝对路径搜索,这意味着在文档中找到 所有 项匹配此路径包括两个客户的 ID。
对于特定节点下的特定子节点,只需使用所选节点下的相对路径:
xml_find_all(customer.01, "ID")
# {xml_nodeset (1)}
# [1] <ID>178</ID>
xml_find_all(customer.01, "FIRST.NAME|LAST.NAME")
# {xml_nodeset (2)}
# [1] <FIRST.NAME>Alvaro</FIRST.NAME>
# [2] <LAST.NAME>Juarez</LAST.NAME>
xml_find_all(customer.01, "*")
# {xml_nodeset (5)}
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>
我正在尝试从 XML 文档中迭代地隔离和操作节点集,但是我在 R 中的 xml2 包中的 xml_find_all() 函数中发现了一个奇怪的行为。有人可以吗帮助我了解应用于节点集的功能范围?
这是一个例子:
library( xml2 )
library( dplyr )
doc <- read_xml( "<MEMBERS>
<CUSTOMER>
<ID>178</ID>
<FIRST.NAME>Alvaro</FIRST.NAME>
<LAST.NAME>Juarez</LAST.NAME>
<ADDRESS>123 Park Ave</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
<CUSTOMER>
<ID>934</ID>
<FIRST.NAME>Janette</FIRST.NAME>
<LAST.NAME>Johnson</LAST.NAME>
<ADDRESS>456 Candy Ln</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
</MEMBERS>" )
doc %>% xml_find_all( '//*') %>% xml_path()
# [1] "/MEMBERS" "/MEMBERS/CUSTOMER[1]"
# [3] "/MEMBERS/CUSTOMER[1]/ID" "/MEMBERS/CUSTOMER[1]/FIRST.NAME"
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS"
# [7] "/MEMBERS/CUSTOMER[1]/ZIP" "/MEMBERS/CUSTOMER[2]"
# [9] "/MEMBERS/CUSTOMER[2]/ID" "/MEMBERS/CUSTOMER[2]/FIRST.NAME"
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS"
#[13] "/MEMBERS/CUSTOMER[2]/ZIP"
对象 customer.01 是一个仅包含来自该客户的数据的节点集。
kids <- xml_children( doc )
customer.01 <- kids[[1]]
customer.01
# {xml_node}
# <CUSTOMER>
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>
为什么函数应用于 customer.01 节点集,return customer.02 的 ID 也是如此?
xml_find_all( customer.01, "//MEMBERS/CUSTOMER/ID" )
# {xml_nodeset (2)}
# [1] <ID>178</ID>
# [2] <ID>934</ID>
如何 return 仅来自该节点集的值?
~~~
好的,所以下面的解决方案中有一个小问题,同样与 xml_find_all() 函数的范围有关。它说它可以应用于文档、节点或节点集。然而...
这种情况适用于节点集:
library( xml2 )
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml"
doc <- read_xml( url )
xml_ns_strip( doc )
nd <- xml_find_all( doc, "//LiquidationOfAssetsDetail|//LiquidationDetail" )
nodei <- nd[[1]]
nodei
# {xml_node}
# <LiquidationOfAssetsDetail>
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# [1] "LAND"
但不是这个:
nodei <- xml_children( nd[[i]] )
nodei
# {xml_nodeset (7)}
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# character(0)
我猜这是将 xml_find_all() 应用于节点集的所有元素的问题,而不是范围问题?
目前,您正在使用 XPath 的双正斜杠 //
从根目录开始的绝对路径搜索,这意味着在文档中找到 所有 项匹配此路径包括两个客户的 ID。
对于特定节点下的特定子节点,只需使用所选节点下的相对路径:
xml_find_all(customer.01, "ID")
# {xml_nodeset (1)}
# [1] <ID>178</ID>
xml_find_all(customer.01, "FIRST.NAME|LAST.NAME")
# {xml_nodeset (2)}
# [1] <FIRST.NAME>Alvaro</FIRST.NAME>
# [2] <LAST.NAME>Juarez</LAST.NAME>
xml_find_all(customer.01, "*")
# {xml_nodeset (5)}
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>