循环问题 - UIMA Ruta
Looping issue - UIMA Ruta
Objective:
分配标题级别。
第一个标题被指定为级别 1。我提取字体系列和它的大小并找到匹配的标题。分配级别后,我取消标记标题,将标题和功能保留在另一个注释 (HeadingHierarchy) 中。关卡完成后,只要 Headinglevel 注释中还有任何标题,我就会一次又一次地调用同一个块。
问题:
该脚本可以很好地查找所有 1 级标题。但是当通过 Call 语句执行该块时,它只会找到每个级别的第一个匹配项(从第 2 级开始)。因此,下面输入的总级别数变为 10,而它必须为 4。
输入:(.txt)
Apache UIMA Ruta Overview =>Arial,18
What is Apache UIMA Ruta? =>Arial,16
Getting started =>Arial,16
UIMA Analysis Engines =>Arial,16
Ruta Engine =>Times New Roman,14
Configuration Parameters =>Arial,10
Annotation Writer =>Times New Roman,14
Configuration Parameters =>Arial,10
Apache UIMA Ruta Language =>Arial,18
Syntax =>Arial,16
Rule elements and their matching order =>Arial,16
脚本:
PACKAGE uima.ruta.example;
DECLARE Headinglevel(STRING family, INT size, INT level);
DECLARE HeadingHierarchy(STRING family, INT size, INT level);
DECLARE FontFamily, FontSize;
STRING family;
INT size;
RETAINTYPE(BREAK);
BREAK? #{-PARTOF(Headinglevel)} @SPECIAL+ W+ COMMA NUM{->MARK(Headinglevel,2,6), MARK(HeadingHierarchy,2,6), MARK(FontFamily,4), MARK(FontSize,6)};
RETAINTYPE;
h:Headinglevel{->h.family = family, HeadingHierarchy.family = family}
<-{FontFamily{PARSE(family)};};
h:Headinglevel{->h.size = size, HeadingHierarchy.size = size}
<-{FontSize{PARSE(size)};};
INT i=1;
BLOCK(ForEachHeadLevel)Document{}
{
# h:Headinglevel{-> family = h.family, size = h.size};
h:Headinglevel{AND(h.family == family, h.size == size)-> h.level=i, HeadingHierarchy.level = i, UNMARK(h)};
}
Headinglevel{->i=i+1, CALL(Test2.ForEachHeadLevel)};
Document{->LOG(" LEVELS : " + (i))};
预期输出:
HeadingHierarchy Feature
Apache UIMA... =>Arial,18 level: 1
What is Apa... =>Arial,16 level: 2
Getting sta... =>Arial,16 level: 2
UIMA Analys... =>Arial,16 level: 2
Ruta Engine... =>Times New Roman,14 level: 3
Configurati... =>Arial,10 level: 4
Annotation ... =>Times New Roman,14 level: 3
Configurati... =>Arial,10 level: 4
Apache UIMA... =>Arial,18 level: 1
Syntax =>Ar... =>Arial,16 level: 2
Rule elemen... =>Arial,16 level: 2
问题是 CALL 将 window 限制在规则元素匹配的范围内。这意味着 BLOCK 仅在现有的 Headinglevel 注释中执行。但是,您需要拥有完整的文档,以便块中的第二条规则发挥作用。
这很可能不是最好的解决方案,但却是我想到的第一个。
您可以将 BLOCK 中的 window 重置为完整的文档,而不管 DOCUMENTBLOCK 的 CALL 操作的限制:
BLOCK (ForEachHeadLevel)Document{}
{
DOCUMENTBLOCK Document{}
{
# h:Headinglevel{-> family = h.family, size = h.size};
h:Headinglevel{AND(h.family == family, h.size == size)-> h.level=i, HeadingHierarchy.level = i, UNMARK(h)};
}
}
DOCUMENTBLOCK
是块扩展。您需要在 additionalExtensions
配置参数中包含 org.apache.uima.ruta.block.DocumentBlockExtension
。
这是另一个使用 FOREACH 块的解决方案:
INT i=0;
FOREACH(hl) Headinglevel{}{
hl{IS(Headinglevel)-> i=i+1, family = hl.family, size = hl.size};
h:Headinglevel{h.family == family, h.size == size -> h.level=i, HeadingHierarchy.level = i, UNMARK(h)};
}
免责声明:我是 UIMA Ruta 的开发者
Objective:
分配标题级别。
第一个标题被指定为级别 1。我提取字体系列和它的大小并找到匹配的标题。分配级别后,我取消标记标题,将标题和功能保留在另一个注释 (HeadingHierarchy) 中。关卡完成后,只要 Headinglevel 注释中还有任何标题,我就会一次又一次地调用同一个块。
问题:
该脚本可以很好地查找所有 1 级标题。但是当通过 Call 语句执行该块时,它只会找到每个级别的第一个匹配项(从第 2 级开始)。因此,下面输入的总级别数变为 10,而它必须为 4。
输入:(.txt)
Apache UIMA Ruta Overview =>Arial,18
What is Apache UIMA Ruta? =>Arial,16
Getting started =>Arial,16
UIMA Analysis Engines =>Arial,16
Ruta Engine =>Times New Roman,14
Configuration Parameters =>Arial,10
Annotation Writer =>Times New Roman,14
Configuration Parameters =>Arial,10
Apache UIMA Ruta Language =>Arial,18
Syntax =>Arial,16
Rule elements and their matching order =>Arial,16
脚本:
PACKAGE uima.ruta.example;
DECLARE Headinglevel(STRING family, INT size, INT level);
DECLARE HeadingHierarchy(STRING family, INT size, INT level);
DECLARE FontFamily, FontSize;
STRING family;
INT size;
RETAINTYPE(BREAK);
BREAK? #{-PARTOF(Headinglevel)} @SPECIAL+ W+ COMMA NUM{->MARK(Headinglevel,2,6), MARK(HeadingHierarchy,2,6), MARK(FontFamily,4), MARK(FontSize,6)};
RETAINTYPE;
h:Headinglevel{->h.family = family, HeadingHierarchy.family = family}
<-{FontFamily{PARSE(family)};};
h:Headinglevel{->h.size = size, HeadingHierarchy.size = size}
<-{FontSize{PARSE(size)};};
INT i=1;
BLOCK(ForEachHeadLevel)Document{}
{
# h:Headinglevel{-> family = h.family, size = h.size};
h:Headinglevel{AND(h.family == family, h.size == size)-> h.level=i, HeadingHierarchy.level = i, UNMARK(h)};
}
Headinglevel{->i=i+1, CALL(Test2.ForEachHeadLevel)};
Document{->LOG(" LEVELS : " + (i))};
预期输出:
HeadingHierarchy Feature
Apache UIMA... =>Arial,18 level: 1
What is Apa... =>Arial,16 level: 2
Getting sta... =>Arial,16 level: 2
UIMA Analys... =>Arial,16 level: 2
Ruta Engine... =>Times New Roman,14 level: 3
Configurati... =>Arial,10 level: 4
Annotation ... =>Times New Roman,14 level: 3
Configurati... =>Arial,10 level: 4
Apache UIMA... =>Arial,18 level: 1
Syntax =>Ar... =>Arial,16 level: 2
Rule elemen... =>Arial,16 level: 2
问题是 CALL 将 window 限制在规则元素匹配的范围内。这意味着 BLOCK 仅在现有的 Headinglevel 注释中执行。但是,您需要拥有完整的文档,以便块中的第二条规则发挥作用。
这很可能不是最好的解决方案,但却是我想到的第一个。
您可以将 BLOCK 中的 window 重置为完整的文档,而不管 DOCUMENTBLOCK 的 CALL 操作的限制:
BLOCK (ForEachHeadLevel)Document{}
{
DOCUMENTBLOCK Document{}
{
# h:Headinglevel{-> family = h.family, size = h.size};
h:Headinglevel{AND(h.family == family, h.size == size)-> h.level=i, HeadingHierarchy.level = i, UNMARK(h)};
}
}
DOCUMENTBLOCK
是块扩展。您需要在 additionalExtensions
配置参数中包含 org.apache.uima.ruta.block.DocumentBlockExtension
。
这是另一个使用 FOREACH 块的解决方案:
INT i=0;
FOREACH(hl) Headinglevel{}{
hl{IS(Headinglevel)-> i=i+1, family = hl.family, size = hl.size};
h:Headinglevel{h.family == family, h.size == size -> h.level=i, HeadingHierarchy.level = i, UNMARK(h)};
}
免责声明:我是 UIMA Ruta 的开发者