集成对jar 文件的调用|切 | awk 和 java 程序合并为一个统一进程
integrate call to a jar file | cut | awk and a java program into one unified process
我目前正在执行一个比较复杂的数据预处理操作,这是:
cat large_file.txt \ | ./reverb -q | cut --fields=16,17,18 | awk -F\t -vq="'" 'function quote(token) { gsub(q, "\"q, token); return q token q } { print quote() "(" quote() ", " quote() ")." }' >> output.txt
如您所见,这很复杂,首先是cat,然后是./reverb,然后是cut,最后是awk。
接下来我想将输出传递给 java 程序,即:
public static void main(String[] args) throws IOException
{
Ontology ontology = new Ontology();
BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/2_January/Prolog/horn_data_test.pl"));
Pattern p = Pattern.compile("'(.*?)'\('(.*?)','(.*?)'\)\.");
String line;
while ((line = br.readLine()) != null)
{
Matcher m = p.matcher(line);
if( m.matches() )
{
String verb = m.group(1);
String object = m.group(2);
String subject = m.group(3);
ontology.addSentence( new Sentence( verb, object, subject ) );
}
}
for( String joint: ontology.getJoints() )
{
for( Integer subind: ontology.getSubjectIndices( joint ) )
{
Sentence xaS = ontology.getSentence( subind );
for( Integer obind: ontology.getObjectIndices( joint ) )
{
Sentence yOb = ontology.getSentence( obind );
Sentence s = new Sentence( xaS.getVerb(),
xaS.getObject(),
yOb.getSubject() );
System.out.println( s );
}
}
}
}
将此过程综合为一个连贯操作的最佳方法是什么?理想情况下,我只想指定输入文件和输出文件,并 运行 一次。就目前而言,整个过程非常混乱。
也许我可以将所有这些调用放入一个 bash 脚本中?可行吗?
输入最初包含英语句子,每行一个,这是:
Oranges are delicious and contain vitamin c.
Brilliant scientists learned that we can prevent scurvy by imbibing vitamin c.
Colorless green ideas sleep furiously.
...
预处理使其看起来像这样:
'contain'('vitamin c','oranges').
'prevent'('scurvy','vitamin c').
'sleep'('furiously','ideas').
...
java 程序用于通过推理学习 "rules",因此如果处理后的数据产生 'contain'('vitamin c','oranges').
& 'prevent'('scurvy','vitamin c').
,则 java 代码将发出'prevent'('scurvy','oranges').
我查看了混响的源代码,我认为调整它以产生您想要的输出非常容易。如果看混响classCommandLineReverb.java,有以下两种方法:
private void extractFromSentReader(ChunkedSentenceReader reader)
throws ExtractorException {
long start;
ChunkedSentenceIterator sentenceIt = reader.iterator();
while (sentenceIt.hasNext()) {
// get the next chunked sentence
ChunkedSentence sent = sentenceIt.next();
chunkTime += sentenceIt.getLastComputeTime();
numSents++;
// make the extractions
start = System.nanoTime();
Iterable<ChunkedBinaryExtraction> extractions = extractor
.extract(sent);
extractTime += System.nanoTime() - start;
for (ChunkedBinaryExtraction extr : extractions) {
numExtrs++;
// run the confidence function
start = System.nanoTime();
double conf = getConf(extr);
confTime += System.nanoTime() - start;
NormalizedBinaryExtraction extrNorm = normalizer
.normalize(extr);
printExtr(extrNorm, conf);
}
if (numSents % messageEvery == 0)
summary();
}
}
private void printExtr(NormalizedBinaryExtraction extr, double conf) {
String arg1 = extr.getArgument1().toString();
String rel = extr.getRelation().toString();
String arg2 = extr.getArgument2().toString();
ChunkedSentence sent = extr.getSentence();
String toks = sent.getTokensAsString();
String pos = sent.getPosTagsAsString();
String chunks = sent.getChunkTagsAsString();
String arg1Norm = extr.getArgument1Norm().toString();
String relNorm = extr.getRelationNorm().toString();
String arg2Norm = extr.getArgument2Norm().toString();
Range arg1Range = extr.getArgument1().getRange();
Range relRange = extr.getRelation().getRange();
Range arg2Range = extr.getArgument2().getRange();
String a1s = String.valueOf(arg1Range.getStart());
String a1e = String.valueOf(arg1Range.getEnd());
String rs = String.valueOf(relRange.getStart());
String re = String.valueOf(relRange.getEnd());
String a2s = String.valueOf(arg2Range.getStart());
String a2e = String.valueOf(arg2Range.getEnd());
String row = Joiner.on("\t").join(
new String[] { currentFile, String.valueOf(numSents), arg1,
rel, arg2, a1s, a1e, rs, re, a2s, a2e,
String.valueOf(conf), toks, pos, chunks, arg1Norm,
relNorm, arg2Norm });
System.out.println(row);
}
第一个方法按句子调用并进行提取。然后它调用第二个方法将制表符分隔值打印到输出流。我猜你所要做的就是实现你自己的第二种方法版本 'printExtr()'.
我目前正在执行一个比较复杂的数据预处理操作,这是:
cat large_file.txt \ | ./reverb -q | cut --fields=16,17,18 | awk -F\t -vq="'" 'function quote(token) { gsub(q, "\"q, token); return q token q } { print quote() "(" quote() ", " quote() ")." }' >> output.txt
如您所见,这很复杂,首先是cat,然后是./reverb,然后是cut,最后是awk。
接下来我想将输出传递给 java 程序,即:
public static void main(String[] args) throws IOException
{
Ontology ontology = new Ontology();
BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/2_January/Prolog/horn_data_test.pl"));
Pattern p = Pattern.compile("'(.*?)'\('(.*?)','(.*?)'\)\.");
String line;
while ((line = br.readLine()) != null)
{
Matcher m = p.matcher(line);
if( m.matches() )
{
String verb = m.group(1);
String object = m.group(2);
String subject = m.group(3);
ontology.addSentence( new Sentence( verb, object, subject ) );
}
}
for( String joint: ontology.getJoints() )
{
for( Integer subind: ontology.getSubjectIndices( joint ) )
{
Sentence xaS = ontology.getSentence( subind );
for( Integer obind: ontology.getObjectIndices( joint ) )
{
Sentence yOb = ontology.getSentence( obind );
Sentence s = new Sentence( xaS.getVerb(),
xaS.getObject(),
yOb.getSubject() );
System.out.println( s );
}
}
}
}
将此过程综合为一个连贯操作的最佳方法是什么?理想情况下,我只想指定输入文件和输出文件,并 运行 一次。就目前而言,整个过程非常混乱。
也许我可以将所有这些调用放入一个 bash 脚本中?可行吗?
输入最初包含英语句子,每行一个,这是:
Oranges are delicious and contain vitamin c.
Brilliant scientists learned that we can prevent scurvy by imbibing vitamin c.
Colorless green ideas sleep furiously.
...
预处理使其看起来像这样:
'contain'('vitamin c','oranges').
'prevent'('scurvy','vitamin c').
'sleep'('furiously','ideas').
...
java 程序用于通过推理学习 "rules",因此如果处理后的数据产生 'contain'('vitamin c','oranges').
& 'prevent'('scurvy','vitamin c').
,则 java 代码将发出'prevent'('scurvy','oranges').
我查看了混响的源代码,我认为调整它以产生您想要的输出非常容易。如果看混响classCommandLineReverb.java,有以下两种方法:
private void extractFromSentReader(ChunkedSentenceReader reader)
throws ExtractorException {
long start;
ChunkedSentenceIterator sentenceIt = reader.iterator();
while (sentenceIt.hasNext()) {
// get the next chunked sentence
ChunkedSentence sent = sentenceIt.next();
chunkTime += sentenceIt.getLastComputeTime();
numSents++;
// make the extractions
start = System.nanoTime();
Iterable<ChunkedBinaryExtraction> extractions = extractor
.extract(sent);
extractTime += System.nanoTime() - start;
for (ChunkedBinaryExtraction extr : extractions) {
numExtrs++;
// run the confidence function
start = System.nanoTime();
double conf = getConf(extr);
confTime += System.nanoTime() - start;
NormalizedBinaryExtraction extrNorm = normalizer
.normalize(extr);
printExtr(extrNorm, conf);
}
if (numSents % messageEvery == 0)
summary();
}
}
private void printExtr(NormalizedBinaryExtraction extr, double conf) {
String arg1 = extr.getArgument1().toString();
String rel = extr.getRelation().toString();
String arg2 = extr.getArgument2().toString();
ChunkedSentence sent = extr.getSentence();
String toks = sent.getTokensAsString();
String pos = sent.getPosTagsAsString();
String chunks = sent.getChunkTagsAsString();
String arg1Norm = extr.getArgument1Norm().toString();
String relNorm = extr.getRelationNorm().toString();
String arg2Norm = extr.getArgument2Norm().toString();
Range arg1Range = extr.getArgument1().getRange();
Range relRange = extr.getRelation().getRange();
Range arg2Range = extr.getArgument2().getRange();
String a1s = String.valueOf(arg1Range.getStart());
String a1e = String.valueOf(arg1Range.getEnd());
String rs = String.valueOf(relRange.getStart());
String re = String.valueOf(relRange.getEnd());
String a2s = String.valueOf(arg2Range.getStart());
String a2e = String.valueOf(arg2Range.getEnd());
String row = Joiner.on("\t").join(
new String[] { currentFile, String.valueOf(numSents), arg1,
rel, arg2, a1s, a1e, rs, re, a2s, a2e,
String.valueOf(conf), toks, pos, chunks, arg1Norm,
relNorm, arg2Norm });
System.out.println(row);
}
第一个方法按句子调用并进行提取。然后它调用第二个方法将制表符分隔值打印到输出流。我猜你所要做的就是实现你自己的第二种方法版本 'printExtr()'.