提取 XML R 中的 sqlQuery 问题 - 查询 Clob 列

Extract XML sqlQuery Issues in R - Querying Clob Column

我有一个名为 CRS.CRS_FILES 的 Oracle 数据库 table,其中有一个名为 FILE_DATA 的列 - 在该 CLOB 列中是一个大的 XML 字符串。

FILE_DATA   FILE_CREATION_DATE
<?xml version="1.0" encoding="utf-8"?><REPORT   1/1/2020
<?xml version="1.0" encoding="utf-8"?><REPORT   1/5/2020
<?xml version="1.0" encoding="utf-8"?><REPORT   1/6/2019
<?xml version="1.0" encoding="utf-8"?><REPORT   1/1/2020
<?xml version="1.0" encoding="utf-8"?><REPORT   1/5/2020

这是它的前几行:

<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME>

设置了我要查询的Xpath:

//REPORT/AGENCYIDENTIFIER

query_string2 <- "SELECT
XMLTYPE(t.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/text()').getClobVal()
FROM CRS.CRS_FILES t"
idtable <- sqlQuery(ch,query_string2, max=10)

query_string2 <- "SELECT
XMLTYPE(t.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/text()').getStringVal()
FROM CRS.CRS_FILES t"
idtable <- sqlQuery(ch,query_string2, max=10)

我不确定我在做什么 - 我知道 sqlQuery 在传递 SQL 查询时存在一些小的格式问题,但无论我尝试什么,我的结果如下所示:

XMLTYPE(T.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/TEXT()').GETCLOBVAL()
1   NA
2   NA
3   NA
4   NA
5   NA
6   NA
7   NA
8   NA
9   NA
10  NA

我做错了什么?我只想提取值 Milwaukee Police Department(见下文)(当然我会将 col 重命名为类似 AGENCYNAME 的名称)

XMLTYPE(T.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/TEXT()').GETCLOBVAL()
1   Milwaukee Police Department
2   Milwaukee Police Department
3   Milwaukee Police Department
4   Milwaukee Police Department
5   Milwaukee Police Department
6   Milwaukee Police Department
7   Milwaukee Police Department
8   Milwaukee Police Department
9   Milwaukee Police Department
10  Milwaukee Police Department

EXTRACT(xml) function 已弃用。相反,使用 XMLTABLE:

SELECT x.agencyname
FROM   CRS.CRS_FILES c
       CROSS JOIN XMLTABLE(
         XMLNAMESPACES(
           'http://www.w3.org/2001/XMLSchema-instance' AS "i",
           DEFAULT 'http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201'
         ),
         '/REPORT'
         PASSING XMLTYPE( c.file_data )
         COLUMNS
           crsreporttimestamp TIMESTAMP     PATH 'CRSREPORTTIMESTAMP',
           agencyidentifier   VARCHAR2(50)  PATH 'AGENCYIDENTIFIER',
           agencyname         VARCHAR2(100) PATH 'AGENCYNAME'
       ) x

或者,在 R 中它应该与转义的双引号相同:

query_string2 <- "SELECT x.agencyname
FROM   CRS.CRS_FILES c
       CROSS JOIN XMLTABLE(
         XMLNAMESPACES(
           'http://www.w3.org/2001/XMLSchema-instance' AS \"i\",
           DEFAULT 'http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201'
         ),
         '/REPORT'
         PASSING XMLTYPE( c.file_data )
         COLUMNS
           crsreporttimestamp TIMESTAMP     PATH 'CRSREPORTTIMESTAMP',
           agencyidentifier   VARCHAR2(50)  PATH 'AGENCYIDENTIFIER',
           agencyname         VARCHAR2(100) PATH 'AGENCYNAME'
       ) x"

idtable <- sqlQuery(ch,query_string2, max=10)

其中,对于你的测试数据:

CREATE TABLE CRS.CRS_FILES ( FILE_DATA CLOB );

INSERT INTO CRS.crs_files VALUES (
'<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
  <CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
  <AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
  <AGENCYNAME>Milwaukee Police Department</AGENCYNAME>
</REPORT>'
)

输出:

| AGENCYNAME                  |
| :-------------------------- |
| Milwaukee Police Department |

如果您确实想使用 EXTRACT,那么您需要指定 XML 命名空间:

SELECT XMLTYPE(t.FILE_DATA).EXTRACT(
         '//REPORT/AGENCYNAME/text()',
         'xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201"'
       ).getStringVal() AS agencyname
FROM   CRS.CRS_FILES t

输出:

| AGENCYNAME                  |
| :-------------------------- |
| Milwaukee Police Department |

db<>fiddle here

当前的 Oracle 查询是问题而不是 RODBC::sqlQuery 方法。简单地说,您的 XPath 不考虑根节点中的默认命名空间。但是,XMLType extract() 函数允许您定义一个临时前缀以便在 XPath 中使用:

extract(XMLType_instance IN XMLType, 
        XPath_string IN VARCHAR2, 
        namespace_string In VARCHAR2 := NULL) RETURN XMLType;

因此,一旦像 doc 那样定义了前缀,就将其应用于 XPath:

query_string2 <- "SELECT XMLTYPE(t.FILE_DATA).EXTRACT('//doc:REPORT/doc:AGENCYNAME/text()',
                           'xmlns:doc=\"http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201\"').getStringVal()
                  FROM CRS.CRS_FILES t"

idtable <- sqlQuery(ch,query_string2, max=10)

Online Demo (适用于 getClobValgetStringVal