Spark 上下文中的 Uima Ruta 内存不足问题

Question

我是运行 apache spark 上的 UIMA 应用程序。 UIMA RUTA 批量处理数百万页进行计算。但有时我面临内存不足 exception.It 有时会抛出异常，因为它成功处理了 2000 页，但有时在 500 上失败页数。

应用程序日志

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57)
        at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39)
        at org.apache.uima.cas.impl.Heap.grow(Heap.java:187)
        at org.apache.uima.cas.impl.Heap.add(Heap.java:241)
        at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844)
        at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489)
        at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68)
        at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73)
        at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA 脚本

WORDLIST EnglishStopWordList = 'stopWords.txt';
WORDLIST FiltersList = 'AnchorFilters.txt';
DECLARE Filters, EnglishStopWords;
DECLARE Anchors, SpanStart,SpanClose;

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
(SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};

(SPECIAL{REGEXP("['\"-=()\[\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\[\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\[\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\[\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\[\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\[\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\[\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
MixCharacterRegex -> Anchors;

"<Value>"  -> SpanStart;
"</Value>" -> SpanClose;

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};

Answer 1

一般情况下，UIMA Ruta内存占用高的原因可以在RutaBasic（注解多，覆盖率信息）或者RuleMatch（规则低效，规则元素匹配多）中找到。

这是你的例子，问题似乎出在别处。堆栈跟踪表明内存已被某些分离规则元素用完，这需要创建新的注释来存储匹配信息。

看来 UIMA Ruta 的版本比较旧，因为行号与我正在查看的源代码完全不匹配。

堆栈跟踪中有七次（!!!）调用 continueOwnMatch。我正在寻找可能导致类似情况的规则，但发现 none。这可能是一个已在较新版本中修复的旧缺陷，或者一些预处理添加了额外的 CW/SW/CAP 注释。

作为第一个建议，我建议两件事：

更新至 UIMA Ruta 2.6.0
去掉所有分离规则元素

您的脚本中并不真正需要析取规则元素。一般来说，如果不是真的需要，根本不应该使用它们。我根本不在生产规则中使用它们。

而不是 (SW | CW | CAP ) 你可以简单地写 W.

而不是 (SPECIAL{REGEXP("['\"-=()\[\]]")}| PM) 你可以写 ANY{OR(REGEXP("['\"-=()\[\]]"),IS(PM))}.

使用ANY作为匹配条件会降低运行时性能。在这个例子中，两个规则而不是规则元素重写可能会更好，例如

SPECIAL{REGEXP("['\"-=()\[\]]")} W ANY?{OR(REGEXP("['\"-=()\[\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};
PM W ANY?{OR(REGEXP("['\"-=()\[\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

（规则开头没有任何锚点的可选规则元素不是可选的）

btw，你的规则还有很大的优化空间。如果非要我猜的话，我会说你至少可以去掉一半的规则和 90% 的所有创建的注释，这也会大大减少内存使用量。

免责声明：我是 UIMA Ruta 的开发者

Spark 上下文中的 Uima Ruta 内存不足问题

Uima Ruta Out of Memory issue in spark context

java

uima

uimanageddocument

apache-spark

ruta