使用具有命名空间的 R 解析 XML
Parse XML using R having namespaces
下面是我从分享点得到的 xml 回复
我正在尝试解析数据并获取以下格式的详细信息
需要输出
title port space datecreat id
test 8080 100.000 2017-04-21 17:29:23 1
apple 8700 108.000 2017-04-21 18:29:23 2
收到输入
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'
<rs:data ItemCount="2">
<z:row title="test" port="8080" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />
<z:row title="apple" port="8700" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>
我是 R 的新手,尝试过几次并且 none 成功了。命名空间和 z:row
无法被检测到。
假设文本在 Lines
中,一种方法是 grep
出 z:row
行,用空格替换等号并使用 read.table
阅读。第一行读取包含一些垃圾列的行,第二行删除垃圾列并设置列名。请注意,即使 XML 无效,这也会起作用。没有使用包。
DF <- read.table(text = gsub("=", " ", grep("z:row", Lines, value = TRUE)))
setNames(DF[seq(3, ncol(DF), 2)], unlist(DF[1, seq(2, ncol(DF)-2, 2)]))
给予:
title port space datecreat id
1 test 8080 100 2017-04-21 17:29:23 1
2 apple 8700 108 2017-04-21 17:29:23 2
注:输入假定为:
Lines <- c(" <?xml version=\"1.0\" encoding=\"utf-8\"?>", " <soap:Envelope xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\">",
" <soap:Body>", " <GetListItemsResponse xmlns=\"http://schemas.microsoft.com/sharepoint/soap/\">",
" <GetListItemsResult>", " <listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'",
" <rs:data ItemCount=\"2\">",
" <z:row title=\"test\" port=\"8080\" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />",
" <z:row title=\"apple\" port=\"8700\" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />",
" </rs:data>", " </listitems>",
" </GetListItemsResult>", " </GetListItemsResponse>",
" </soap:Body>", " </soap:Envelope>")
如果您的输入是一个名为 Lines_n
的以换行符分隔的长字符串,比方说,那么 运行 首先是:
Lines <- readLines(textConnection(Lines_n))
考虑注册 z
命名空间前缀并使用 XML 的内部变量 xmlAttrsToDataframe
使用三重冒号运算符:
library(XML)
txt='<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s=\'uuid:SBDSHDSH-DSJHD\' xmlns:dt=\'uuid:CSDSJHA-DGGD\' xmlns:rs=\'urn:schemas-microsoft-com:rowset\' xmlns:z=\'#RowsetSchema\'>
<rs:data ItemCount="2">
<z:row title="test" port="8080" space=\'100.000\' datecreat=\'2017-04-21 17:29:23\' id=\'1\' />
<z:row title="apple" port="8700" space=\'108.000\' datecreat=\'2017-04-21 17:29:23\' id=\'2\' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>'
doc <- xmlParse(txt)
namespaces <- c(z="#RowsetSchema")
df <- XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//z:row', namespaces))
df
# title port space datecreat id
# 1 test 8080 100.000 2017-04-21 17:29:23 1
# 2 apple 8700 108.000 2017-04-21 17:29:23 2
那是无效的 XML 而且,虽然我是第一个抱怨 SharePoint 的人,但它本身不会产生损坏的东西。很可能是某个正在攻击您的 SharePoint 服务器的同事损坏了某些东西,但真的很难把它弄坏到这种程度。
无论如何,这是 XML:
的有效版本
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'>
<rs:data ItemCount="2">
<z:row title="test" port="8080" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />
<z:row title="apple" port="8700" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>
而且,它可以很好地解析和提取:
library(xml2)
doc <- read_xml("test.xml")
ns <- xml_ns_rename(xml_ns(doc), d1 = "a")
xml_find_all(doc, ".//z:row") %>%
map(xml_attrs) %>%
map_df(as.list)
## # A tibble: 2 × 5
## title port space datecreat id
## <chr> <chr> <chr> <chr> <chr>
## 1 test 8080 100.000 2017-04-21 17:29:23 1
## 2 apple 8700 108.000 2017-04-21 17:29:23 2
下面是我从分享点得到的 xml 回复 我正在尝试解析数据并获取以下格式的详细信息
需要输出
title port space datecreat id
test 8080 100.000 2017-04-21 17:29:23 1
apple 8700 108.000 2017-04-21 18:29:23 2
收到输入
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'
<rs:data ItemCount="2">
<z:row title="test" port="8080" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />
<z:row title="apple" port="8700" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>
我是 R 的新手,尝试过几次并且 none 成功了。命名空间和 z:row
无法被检测到。
假设文本在 Lines
中,一种方法是 grep
出 z:row
行,用空格替换等号并使用 read.table
阅读。第一行读取包含一些垃圾列的行,第二行删除垃圾列并设置列名。请注意,即使 XML 无效,这也会起作用。没有使用包。
DF <- read.table(text = gsub("=", " ", grep("z:row", Lines, value = TRUE)))
setNames(DF[seq(3, ncol(DF), 2)], unlist(DF[1, seq(2, ncol(DF)-2, 2)]))
给予:
title port space datecreat id
1 test 8080 100 2017-04-21 17:29:23 1
2 apple 8700 108 2017-04-21 17:29:23 2
注:输入假定为:
Lines <- c(" <?xml version=\"1.0\" encoding=\"utf-8\"?>", " <soap:Envelope xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\">",
" <soap:Body>", " <GetListItemsResponse xmlns=\"http://schemas.microsoft.com/sharepoint/soap/\">",
" <GetListItemsResult>", " <listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'",
" <rs:data ItemCount=\"2\">",
" <z:row title=\"test\" port=\"8080\" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />",
" <z:row title=\"apple\" port=\"8700\" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />",
" </rs:data>", " </listitems>",
" </GetListItemsResult>", " </GetListItemsResponse>",
" </soap:Body>", " </soap:Envelope>")
如果您的输入是一个名为 Lines_n
的以换行符分隔的长字符串,比方说,那么 运行 首先是:
Lines <- readLines(textConnection(Lines_n))
考虑注册 z
命名空间前缀并使用 XML 的内部变量 xmlAttrsToDataframe
使用三重冒号运算符:
library(XML)
txt='<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s=\'uuid:SBDSHDSH-DSJHD\' xmlns:dt=\'uuid:CSDSJHA-DGGD\' xmlns:rs=\'urn:schemas-microsoft-com:rowset\' xmlns:z=\'#RowsetSchema\'>
<rs:data ItemCount="2">
<z:row title="test" port="8080" space=\'100.000\' datecreat=\'2017-04-21 17:29:23\' id=\'1\' />
<z:row title="apple" port="8700" space=\'108.000\' datecreat=\'2017-04-21 17:29:23\' id=\'2\' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>'
doc <- xmlParse(txt)
namespaces <- c(z="#RowsetSchema")
df <- XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//z:row', namespaces))
df
# title port space datecreat id
# 1 test 8080 100.000 2017-04-21 17:29:23 1
# 2 apple 8700 108.000 2017-04-21 17:29:23 2
那是无效的 XML 而且,虽然我是第一个抱怨 SharePoint 的人,但它本身不会产生损坏的东西。很可能是某个正在攻击您的 SharePoint 服务器的同事损坏了某些东西,但真的很难把它弄坏到这种程度。
无论如何,这是 XML:
的有效版本<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
<GetListItemsResult>
<listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'>
<rs:data ItemCount="2">
<z:row title="test" port="8080" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />
<z:row title="apple" port="8700" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />
</rs:data>
</listitems>
</GetListItemsResult>
</GetListItemsResponse>
</soap:Body>
</soap:Envelope>
而且,它可以很好地解析和提取:
library(xml2)
doc <- read_xml("test.xml")
ns <- xml_ns_rename(xml_ns(doc), d1 = "a")
xml_find_all(doc, ".//z:row") %>%
map(xml_attrs) %>%
map_df(as.list)
## # A tibble: 2 × 5
## title port space datecreat id
## <chr> <chr> <chr> <chr> <chr>
## 1 test 8080 100.000 2017-04-21 17:29:23 1
## 2 apple 8700 108.000 2017-04-21 17:29:23 2