集成对jar 文件的调用|切 | awk 和 java 程序合并为一个统一进程

integrate call to a jar file | cut | awk and a java program into one unified process

我目前正在执行一个比较复杂的数据预处理操作,这是:

cat large_file.txt \ | ./reverb -q | cut --fields=16,17,18 | awk -F\t -vq="'" 'function quote(token) { gsub(q, "\"q, token); return q token q } { print quote() "(" quote() ", " quote() ")." }' >> output.txt

如您所见,这很复杂,首先是cat,然后是./reverb,然后是cut,最后是awk。

接下来我想将输出传递给 java 程序,即:

public static void main(String[] args) throws IOException 
{
    Ontology ontology = new Ontology();
    BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/2_January/Prolog/horn_data_test.pl"));
    Pattern p = Pattern.compile("'(.*?)'\('(.*?)','(.*?)'\)\."); 
    String line;
    while ((line = br.readLine()) != null) 
    {
        Matcher m = p.matcher(line);
        if( m.matches() ) 
        {
            String verb    = m.group(1);
            String object  = m.group(2);
            String subject = m.group(3);
            ontology.addSentence( new Sentence( verb, object, subject ) );
        }
    }

    for( String joint: ontology.getJoints() )
    {
        for( Integer subind: ontology.getSubjectIndices( joint ) )
        {
            Sentence xaS = ontology.getSentence( subind );
            for( Integer obind: ontology.getObjectIndices( joint ) )
            {
                Sentence yOb = ontology.getSentence( obind );
                Sentence s = new Sentence( xaS.getVerb(),
                                           xaS.getObject(),
                                           yOb.getSubject() );
                System.out.println( s );
            }
        }
    }
}   

将此过程综合为一个连贯操作的最佳方法是什么?理想情况下,我只想指定输入文件和输出文件,并 运行 一次。就目前而言,整个过程非常混乱。

也许我可以将所有这些调用放入一个 bash 脚本中?可行吗?

输入最初包含英语句子,每行一个,这是:

Oranges are delicious and contain vitamin c.
Brilliant scientists learned that we can prevent scurvy by imbibing vitamin c.
Colorless green ideas sleep furiously.
...

预处理使其看起来像这样:

'contain'('vitamin c','oranges').
'prevent'('scurvy','vitamin c').
'sleep'('furiously','ideas').
...

java 程序用于通过推理学习 "rules",因此如果处理后的数据产生 'contain'('vitamin c','oranges'). & 'prevent'('scurvy','vitamin c').,则 java 代码将发出'prevent'('scurvy','oranges').

我查看了混响的源代码,我认为调整它以产生您想要的输出非常容易。如果看混响classCommandLineReverb.java,有以下两种方法:

private void extractFromSentReader(ChunkedSentenceReader reader)
        throws ExtractorException {
    long start;

    ChunkedSentenceIterator sentenceIt = reader.iterator();

    while (sentenceIt.hasNext()) {
        // get the next chunked sentence
        ChunkedSentence sent = sentenceIt.next();
        chunkTime += sentenceIt.getLastComputeTime();

        numSents++;

        // make the extractions
        start = System.nanoTime();
        Iterable<ChunkedBinaryExtraction> extractions = extractor
                .extract(sent);
        extractTime += System.nanoTime() - start;

        for (ChunkedBinaryExtraction extr : extractions) {
            numExtrs++;

            // run the confidence function
            start = System.nanoTime();
            double conf = getConf(extr);
            confTime += System.nanoTime() - start;

            NormalizedBinaryExtraction extrNorm = normalizer
                    .normalize(extr);
            printExtr(extrNorm, conf);
        }
        if (numSents % messageEvery == 0)
            summary();
    }
}

private void printExtr(NormalizedBinaryExtraction extr, double conf) {
    String arg1 = extr.getArgument1().toString();
    String rel = extr.getRelation().toString();
    String arg2 = extr.getArgument2().toString();

    ChunkedSentence sent = extr.getSentence();
    String toks = sent.getTokensAsString();
    String pos = sent.getPosTagsAsString();
    String chunks = sent.getChunkTagsAsString();
    String arg1Norm = extr.getArgument1Norm().toString();
    String relNorm = extr.getRelationNorm().toString();
    String arg2Norm = extr.getArgument2Norm().toString();

    Range arg1Range = extr.getArgument1().getRange();
    Range relRange = extr.getRelation().getRange();
    Range arg2Range = extr.getArgument2().getRange();
    String a1s = String.valueOf(arg1Range.getStart());
    String a1e = String.valueOf(arg1Range.getEnd());
    String rs = String.valueOf(relRange.getStart());
    String re = String.valueOf(relRange.getEnd());
    String a2s = String.valueOf(arg2Range.getStart());
    String a2e = String.valueOf(arg2Range.getEnd());

    String row = Joiner.on("\t").join(
            new String[] { currentFile, String.valueOf(numSents), arg1,
                    rel, arg2, a1s, a1e, rs, re, a2s, a2e,
                    String.valueOf(conf), toks, pos, chunks, arg1Norm,
                    relNorm, arg2Norm });

    System.out.println(row);
}

第一个方法按句子调用并进行提取。然后它调用第二个方法将制表符分隔值打印到输出流。我猜你所要做的就是实现你自己的第二种方法版本 'printExtr()'.