如何通过名称查找器在 运行 个标记后反转标记化?

How to reverse tokenization after running tokens through name finder?

在使用NameFinderME 找到一系列token 中的名字后,我想逆向分词并用修改后的名字重建原文。有没有一种方法可以按照执行标记化操作的确切方式来反转标记化操作,以便输出是与输入完全相同的结构?

例子

Hello my name is John. This is another sentence.

找句子

Hello my name is John.
This is another sentence.

标记句子。

> Hello 
> my 
> name 
> is 
> John.
> 
> This 
> is 
> another 
> sentence.

到目前为止,我分析上述标记的代码看起来像这样。

            TokenNameFinderModel model3 = new TokenNameFinderModel(modelIn3);
            NameFinderME nameFinder = new NameFinderME(model3);

            List<Span[]> spans = new List<Span[]>();
            foreach (string sentence in sentences)
            {
                String[] tokens = tokenizer.tokenize(sentence);

                Span[] nameSpans = nameFinder.find(tokens);
                string[] namedEntities = Span.spansToStrings(nameSpans, tokens);


                //I want to modify each of the named entities found
                //foreach(string s in namedEntities) { modifystring(s) };


                spans.Add(nameSpans);

            }

所需的输出,可能会掩盖找到的名称。

Hello my name is XXXX. This is another sentence.

在文档中,有一个 link 到这个 post 描述了如何使用 detokenizer。我不明白操作数组与原始标记化的关系(如果有的话)

https://issues.apache.org/jira/browse/OPENNLP-216

Create instance of SimpleTokenizer.
String sentence = "He said \"This is a test\".";
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;
Tokenize the sentence using tokenize(String str) method from SimpleTokenizer
String tokens[] = instance.tokenize(sentence);
The operations array must have the same number of operation name as tokens array. Basically array length should be equal.
Store the operation name N-times (tokens.length times) into operation array.
Operation operations[] = new Operation[tokens.length];
String oper = "MOVE_RIGHT"; // please refer above list for the list of operations
for (int i = 0; i < tokens.length; i++) 
{ operations[i] = Operation.parse(oper); } 
System.out.println(operations.length); 
Here the operation array length will be equal to the tokens array length.
Now create an instance of DetokenizationDictionary by passing tokens and operations arrays to the constructor.
DetokenizationDictionary detokenizeDict = new DetokenizationDictionary(tokens, operations);
Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to detokenize the tokens.
DictionaryDetokenizer dictDetokenize = new DictionaryDetokenizer(detokenizeDict);
DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and b). split marker 
String st = dictDetokenize.detokenize(tokens, " ");
Output:

使用Detokenizer.

String text = detokenize(myTokens, null);