在 BaseX 中优化缓慢的 XQuery 查询
Optimizing a slow XQuery query in BaseX
我有一个只有一个小 XML 文件的 BaseX XML 数据库。这些文件基本上由两种结构组成。一个是 PlatformCategory
有 46 个实例,另一个 PlatformGenericType
有 213 个实例。
PlatformGenericType
在 href
属性中引用了 PlatformCategory
。
<PlatformGeneralType id="/plib/platformgeneraltypes/pgt1">
<name>No statement</name>
<enum>NO_STATEMENT</enum>
<isOfPlatformCategory href="/plib/platformcategories/pc1"/>
<readOnly>true</readOnly>
</PlatformGeneralType>
<PlatformCategory id="/plib/platformcategories/pc1">
<name>No statement</name>
<enum>NO_STATEMENT</enum>
<environment>AIR</environment>
<readOnly>true</readOnly>
</PlatformCategory>
当我执行以下查询时,大约需要六秒钟才能得到结果:
//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]
如何优化此查询?
注意我运行"optimize all".
更新: 之前查询的问题好像解决了。但是当我使用以下扩展查询时,查询需要 44.28 秒:
/PLib/PlatformSpecificTypes/PlatformSpecificType
[isOfPlatformGeneralType/@href=/PLib/PlatformGeneralTypes/PlatformGeneralType
[isOfPlatformCategory/@href=/PLib/PlatformCategories/PlatformCategory
[environment='AIR']/@id]/@id]
PlatformSpecificType
有 8939 个实例,其结构:
<PlatformSpecificTypes>
<PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
<name>Meko 360H2</name>
<lethalityLevel>LOW</lethalityLevel>
<isOfPlatformGeneralType href="/plib/platformgeneraltypes/pgt62"/>
<ownedByCountry href="/plib/countries/10"/>
</PlatformSpecificType>
</PlatformSpecificTypes>
其查询信息:
查询:
/PLib/PlatformSpecificTypes/PlatformSpecificType[isOfPlatformGeneralType/@href=/PLib/PlatformGeneralTypes/PlatformGeneralType[isOfPlatformCategory/@href=/PLib/PlatformCategories/PlatformCategory[environment='AIR']/@id]/@id]
结果:
- 命中:3642 件
- 更新:0 件
- 印刷版:2048 KB
- 读取锁定:本地 [command_plib]
- 写入锁定:none
定时:
- 解析:1.25 毫秒
- 编译:0.71 毫秒
- 评估:44248.94 毫秒
- 打印:37.11 毫秒
- 总时间:44288.02 毫秒
查询计划:
数据库属性:
Database Properties
Name: command_plib
Size: 20247 KB
Nodes: 781606
Documents: 1
Binaries: 0
Timestamp: 2015-06-12-10-12-14
Resource Properties
Input Path: /home/sceran/Documents/PLIB/command_plib.xml
Input Size: 21354 KB
Timestamp: 2015-06-11-15-34-07
Encoding: UTF-8
CHOP: true
Indexes
Up-to-date: true
TEXTINDEX: true
ATTRINDEX: true
FTINDEX: false
LANGUAGE: English
STEMMING: true
CASESENS: true
DIACRITICS: false
STOPWORDS:
UPDINDEX: false
AUTOOPTIMIZE: false
MAXCATS: 100
MAXLEN: 96
查询信息:
Compiling:
- rewriting descendant-or-self step(s)
- rewriting descendant-or-self step(s)
- converting descendant::*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)] to child steps
Query:
//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]
Optimized Query:
db:open-pre("command_plib",0)/*:PLib/*:PlatformGeneralTypes/*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)]
Result:
- Hit(s): 55 Items
- Updated: 0 Items
- Printed: 12776 Bytes
- Read Locking: local [command_plib]
- Write Locking: none
Timing:
- Parsing: 0.55 ms
- Compiling: 0.3 ms
- Evaluating: 5786.29 ms
- Printing: 1.0 ms
- Total Time: 5788.15 ms
Query plan:
<QueryPlan compiled="true">
<IterPath>
<DBNode name="command_plib" pre="0"/>
<IterStep axis="child" test="*:PLib"/>
<IterStep axis="child" test="*:PlatformGeneralTypes"/>
<IterStep axis="child" test="*:PlatformGeneralType">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:isOfPlatformCategory"/>
<IterStep axis="attribute" test="*:href"/>
</CachedPath>
<IterPath>
<Root/>
<IterStep axis="descendant" test="*:PlatformCategory">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:environment"/>
</CachedPath>
<Str value="AIR" type="xs:string"/>
</CmpG>
</IterStep>
<IterStep axis="attribute" test="*:id"/>
</IterPath>
</CmpG>
</IterStep>
</IterPath>
</QueryPlan>
更新二:
我怀疑 PlatformSpecificTypes 的结构会阻止索引。我想知道如果我如下更改它,它会提高查询性能吗?
<PlatformSpecificTypes>
<PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
<name>Meko 360H2</name>
<lethalityLevel>LOW</lethalityLevel>
**<isOfPlatformGeneralType>/plib/platformgeneraltypes/pgt62 </isOfPlatformGeneralType>**
<ownedByCountry href="/plib/countries/10"/>
</PlatformSpecificType>
</PlatformSpecificTypes>
更新三:
我已经上传了XML file in a gist,这样你就可以检查它了。
现在,当我执行以下查询时,我需要大约 28 秒才能得到结果。
/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]
查询信息如下:
Query:
/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]
Result:
- Hit(s): 3642 Items
- Updated: 0 Items
- Printed: 257 KB
- Read Locking: local [Output6]
- Write Locking: none
Timing:
- Parsing: 0.66 ms
- Compiling: 0.34 ms
- Evaluating: 28398.32 ms
- Printing: 4.63 ms
- Total Time: 28403.97 ms
Query plan:
<QueryPlan compiled="true">
<IterPath>
<DBNode name="Output6" pre="0"/>
<IterStep axis="child" test="*:root"/>
<IterStep axis="child" test="*:PlSpTys"/>
<IterStep axis="child" test="*:PlSpTy">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:isOfPlGeTy"/>
<IterStep axis="attribute" test="*:href"/>
</CachedPath>
<IterPath>
<Root/>
<IterStep axis="child" test="*:root"/>
<IterStep axis="child" test="*:PlGeTys"/>
<IterStep axis="child" test="*:PlGeTy">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:isOfPlCt"/>
<IterStep axis="attribute" test="*:href"/>
</CachedPath>
<IterPath>
<Root/>
<IterStep axis="child" test="*:root"/>
<IterStep axis="child" test="*:PlCts"/>
<IterStep axis="child" test="*:PlCt">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:environment"/>
</CachedPath>
<Str value="AIR" type="xs:string"/>
</CmpG>
</IterStep>
<IterStep axis="attribute" test="*:id"/>
</IterPath>
</CmpG>
</IterStep>
<IterStep axis="attribute" test="*:id"/>
</IterPath>
</CmpG>
</IterStep>
</IterPath>
</QueryPlan>
你能帮我优化查询持续时间吗?
BaseX 似乎没有意识到它应该用静态结果预处理 "inner" 部分,因此评估成本约为 O(n^2)
而不是 O(n)
.
重新格式化您的查询(在我的机器上大约需要 30 秒)以更好地理解它显示第一个谓词内比较的整个右侧是静态的,不依赖于 PlSpTy
当前分析的元素:
/root/PlSpTys/PlSpTy[
isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[
isOfPlCt/@href=/root/PlCts/PlCt[
environment='AIR'
]/@id
]/@id
]
在我的机器上对此进行评估大约需要 9 毫秒,这不是很多,但如果重复 运行 可能会变得昂贵。计算 PlSpTy
个元素 (count(/root/PlSpTys/PlSpTy)
) 的数量显示接近 8939 个这样的元素,因此内部部分的评估成本约为 8939*9ms ~= 80s
-- something 肯定已经优化掉了,但不是所有的东西。
如果我们简单地提取查询的这一部分并预先计算它会发生什么?
let $compare :=
/root/PlGeTys/PlGeTy[
isOfPlCt/@href=/root/PlCts/PlCt[
environment='AIR'
]/@id
]/@id
return
/root/PlSpTys/PlSpTy[
isOfPlGeTy/@href=$compare
]
计算时间下降到 16 毫秒,其中四分之一用于实际打印结果。我开了一个bug report requesting better optimization. (Update: some optimizations have been applied).
最新的 snapshot of BaseX 为您的查询提供了优化:如果多次请求路径表达式的结果,它们现在将被缓存。优化将在8.2.2版本正式上线。
我有一个只有一个小 XML 文件的 BaseX XML 数据库。这些文件基本上由两种结构组成。一个是 PlatformCategory
有 46 个实例,另一个 PlatformGenericType
有 213 个实例。
PlatformGenericType
在 href
属性中引用了 PlatformCategory
。
<PlatformGeneralType id="/plib/platformgeneraltypes/pgt1">
<name>No statement</name>
<enum>NO_STATEMENT</enum>
<isOfPlatformCategory href="/plib/platformcategories/pc1"/>
<readOnly>true</readOnly>
</PlatformGeneralType>
<PlatformCategory id="/plib/platformcategories/pc1">
<name>No statement</name>
<enum>NO_STATEMENT</enum>
<environment>AIR</environment>
<readOnly>true</readOnly>
</PlatformCategory>
当我执行以下查询时,大约需要六秒钟才能得到结果:
//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]
如何优化此查询?
注意我运行"optimize all".
更新: 之前查询的问题好像解决了。但是当我使用以下扩展查询时,查询需要 44.28 秒:
/PLib/PlatformSpecificTypes/PlatformSpecificType
[isOfPlatformGeneralType/@href=/PLib/PlatformGeneralTypes/PlatformGeneralType
[isOfPlatformCategory/@href=/PLib/PlatformCategories/PlatformCategory
[environment='AIR']/@id]/@id]
PlatformSpecificType
有 8939 个实例,其结构:
<PlatformSpecificTypes>
<PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
<name>Meko 360H2</name>
<lethalityLevel>LOW</lethalityLevel>
<isOfPlatformGeneralType href="/plib/platformgeneraltypes/pgt62"/>
<ownedByCountry href="/plib/countries/10"/>
</PlatformSpecificType>
</PlatformSpecificTypes>
其查询信息:
查询: /PLib/PlatformSpecificTypes/PlatformSpecificType[isOfPlatformGeneralType/@href=/PLib/PlatformGeneralTypes/PlatformGeneralType[isOfPlatformCategory/@href=/PLib/PlatformCategories/PlatformCategory[environment='AIR']/@id]/@id] 结果: - 命中:3642 件 - 更新:0 件 - 印刷版:2048 KB - 读取锁定:本地 [command_plib] - 写入锁定:none 定时: - 解析:1.25 毫秒 - 编译:0.71 毫秒 - 评估:44248.94 毫秒 - 打印:37.11 毫秒 - 总时间:44288.02 毫秒 查询计划:
数据库属性:
Database Properties
Name: command_plib
Size: 20247 KB
Nodes: 781606
Documents: 1
Binaries: 0
Timestamp: 2015-06-12-10-12-14
Resource Properties
Input Path: /home/sceran/Documents/PLIB/command_plib.xml
Input Size: 21354 KB
Timestamp: 2015-06-11-15-34-07
Encoding: UTF-8
CHOP: true
Indexes
Up-to-date: true
TEXTINDEX: true
ATTRINDEX: true
FTINDEX: false
LANGUAGE: English
STEMMING: true
CASESENS: true
DIACRITICS: false
STOPWORDS:
UPDINDEX: false
AUTOOPTIMIZE: false
MAXCATS: 100
MAXLEN: 96
查询信息:
Compiling:
- rewriting descendant-or-self step(s)
- rewriting descendant-or-self step(s)
- converting descendant::*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)] to child steps
Query:
//PlatformGeneralType[isOfPlatformCategory/@href=//PlatformCategory[environment="AIR"]/@id]
Optimized Query:
db:open-pre("command_plib",0)/*:PLib/*:PlatformGeneralTypes/*:PlatformGeneralType[(*:isOfPlatformCategory/@*:href = root()/descendant::*:PlatformCategory[(*:environment = "AIR")]/@*:id)]
Result:
- Hit(s): 55 Items
- Updated: 0 Items
- Printed: 12776 Bytes
- Read Locking: local [command_plib]
- Write Locking: none
Timing:
- Parsing: 0.55 ms
- Compiling: 0.3 ms
- Evaluating: 5786.29 ms
- Printing: 1.0 ms
- Total Time: 5788.15 ms
Query plan:
<QueryPlan compiled="true">
<IterPath>
<DBNode name="command_plib" pre="0"/>
<IterStep axis="child" test="*:PLib"/>
<IterStep axis="child" test="*:PlatformGeneralTypes"/>
<IterStep axis="child" test="*:PlatformGeneralType">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:isOfPlatformCategory"/>
<IterStep axis="attribute" test="*:href"/>
</CachedPath>
<IterPath>
<Root/>
<IterStep axis="descendant" test="*:PlatformCategory">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:environment"/>
</CachedPath>
<Str value="AIR" type="xs:string"/>
</CmpG>
</IterStep>
<IterStep axis="attribute" test="*:id"/>
</IterPath>
</CmpG>
</IterStep>
</IterPath>
</QueryPlan>
更新二: 我怀疑 PlatformSpecificTypes 的结构会阻止索引。我想知道如果我如下更改它,它会提高查询性能吗?
<PlatformSpecificTypes>
<PlatformSpecificType id="/plib/platformspecifictypes/DataShip.3">
<name>Meko 360H2</name>
<lethalityLevel>LOW</lethalityLevel>
**<isOfPlatformGeneralType>/plib/platformgeneraltypes/pgt62 </isOfPlatformGeneralType>**
<ownedByCountry href="/plib/countries/10"/>
</PlatformSpecificType>
</PlatformSpecificTypes>
更新三: 我已经上传了XML file in a gist,这样你就可以检查它了。
现在,当我执行以下查询时,我需要大约 28 秒才能得到结果。
/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]
查询信息如下:
Query:
/root/PlSpTys/PlSpTy[isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[isOfPlCt/@href=/root/PlCts/PlCt[environment='AIR']/@id]/@id]
Result:
- Hit(s): 3642 Items
- Updated: 0 Items
- Printed: 257 KB
- Read Locking: local [Output6]
- Write Locking: none
Timing:
- Parsing: 0.66 ms
- Compiling: 0.34 ms
- Evaluating: 28398.32 ms
- Printing: 4.63 ms
- Total Time: 28403.97 ms
Query plan:
<QueryPlan compiled="true">
<IterPath>
<DBNode name="Output6" pre="0"/>
<IterStep axis="child" test="*:root"/>
<IterStep axis="child" test="*:PlSpTys"/>
<IterStep axis="child" test="*:PlSpTy">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:isOfPlGeTy"/>
<IterStep axis="attribute" test="*:href"/>
</CachedPath>
<IterPath>
<Root/>
<IterStep axis="child" test="*:root"/>
<IterStep axis="child" test="*:PlGeTys"/>
<IterStep axis="child" test="*:PlGeTy">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:isOfPlCt"/>
<IterStep axis="attribute" test="*:href"/>
</CachedPath>
<IterPath>
<Root/>
<IterStep axis="child" test="*:root"/>
<IterStep axis="child" test="*:PlCts"/>
<IterStep axis="child" test="*:PlCt">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:environment"/>
</CachedPath>
<Str value="AIR" type="xs:string"/>
</CmpG>
</IterStep>
<IterStep axis="attribute" test="*:id"/>
</IterPath>
</CmpG>
</IterStep>
<IterStep axis="attribute" test="*:id"/>
</IterPath>
</CmpG>
</IterStep>
</IterPath>
</QueryPlan>
你能帮我优化查询持续时间吗?
BaseX 似乎没有意识到它应该用静态结果预处理 "inner" 部分,因此评估成本约为 O(n^2)
而不是 O(n)
.
重新格式化您的查询(在我的机器上大约需要 30 秒)以更好地理解它显示第一个谓词内比较的整个右侧是静态的,不依赖于 PlSpTy
当前分析的元素:
/root/PlSpTys/PlSpTy[
isOfPlGeTy/@href=/root/PlGeTys/PlGeTy[
isOfPlCt/@href=/root/PlCts/PlCt[
environment='AIR'
]/@id
]/@id
]
在我的机器上对此进行评估大约需要 9 毫秒,这不是很多,但如果重复 运行 可能会变得昂贵。计算 PlSpTy
个元素 (count(/root/PlSpTys/PlSpTy)
) 的数量显示接近 8939 个这样的元素,因此内部部分的评估成本约为 8939*9ms ~= 80s
-- something 肯定已经优化掉了,但不是所有的东西。
如果我们简单地提取查询的这一部分并预先计算它会发生什么?
let $compare :=
/root/PlGeTys/PlGeTy[
isOfPlCt/@href=/root/PlCts/PlCt[
environment='AIR'
]/@id
]/@id
return
/root/PlSpTys/PlSpTy[
isOfPlGeTy/@href=$compare
]
计算时间下降到 16 毫秒,其中四分之一用于实际打印结果。我开了一个bug report requesting better optimization. (Update: some optimizations have been applied).
最新的 snapshot of BaseX 为您的查询提供了优化:如果多次请求路径表达式的结果,它们现在将被缓存。优化将在8.2.2版本正式上线。