如何从解析的文本中提取名词短语

How to extract noun phrases from the parsed text

我已经用选区解析器解析了一个文本,将结果复制到如下文本文件中:

(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP we)) (VP (VBD went) (PP (TO to)....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (PRP I)) (VP (VBD was) (NP (NP (EX...
(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP I)) (VP (VBD went) (PP (TO to.....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (NNP Jim)) (VP (VBD was) (NP (NP (....
(ROOT (S (S (NP (PRP I)) (VP (VBD started) (S (VP (VBG talking) (PP.....

我需要从此文本文件中提取所有名词短语 (NP)。我编写了以下代码,仅从每行中提取第一个 NP。但是,我需要提取所有名词短语。我的代码是:

public class nounPhrase {

    public static int findClosingParen(char[] text, int openPos) {
        int closePos = openPos;
        int counter = 1;
        while (counter > 0) {
            char c = text[++closePos];
            if (c == '(') {

                counter++;
            }
            else if (c == ')') {
                counter--;
            }
        }
        return closePos;
    }

     public static void main(String[] args) throws IOException {

        ArrayList npList = new ArrayList ();
        String line;
        String line1;
        int np;

        String Input = "/local/Input/Temp/Temp.txt";

        String Output = "/local/Output/Temp/Temp-out.txt";  

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"
        ));
        while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        np = findClosingParen (lineArray, line.indexOf("(NP"));
        line1 = line.substring(line.indexOf("(NP"),np+1);
        System.out.print(line1+"\n");
        }
    }
}

输出为:

(NP (NN Yesterday))...I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also
(NP (NNP Jim)).....I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also

我的代码只采用每行的第一个 NP 及其右括号,但我需要从文本中提取所有 NP。

您正在构建一个解析器(.. 用于您的自然语言解析器生成的代码),这是一个具有广泛学术文档的主题。 您可以构建的最简单的解析器是 LL 解析器。看看维基百科上的这篇文章,它有一些很好的例子可以让你得到启发: http://en.wikipedia.org/wiki/LL_parser

关于一般解析的维基百科条目可能会让您对一般解析领域有所了解: 维基百科文章:http://en.wikipedia.org/wiki/Parsing

给你。我稍微改动了一下,它变得很乱,但如果你真的需要漂亮的代码,我可以清理它。

import java.io.*;
import java.util.*;

public class nounPhrase {
    public static void main(String[] args)throws IOException{

        ArrayList<String> npList = new ArrayList<String>();
        String line = "";
        String line1 = "";

        String Input = "/local/Input/Temp/Temp.txt";
        String Output = "/local/Output/Temp/Temp-out.txt";

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));

        while ((line = br.readLine()) != null){
            char[] lineArray = line.toCharArray();
            int temp;
            for (int i=0; i+2<lineArray.length; i++){
                if(lineArray[i]=='(' && lineArray[i+1]=='N' && lineArray[i+2]=='P'){
                    temp = i;
                    while(lineArray[i] != ')'){
                        i++;
                    }
                    i+=2;
                    line1 = line.substring(temp,i);
                    npList.add(line1);
                }
            }
            npList.add("*");
        }

        for (int i=0; i<npList.size(); i++){
            if(!(npList.get(i).equals("*"))){
                System.out.print(npList.get(i));
                if(i<npList.size()-1 && npList.get(i+1).equals("*")){
                    System.out.println();
                }
            }
        }
    }
} 

并且仅供参考,您的代码仅选择第一次出现的 NP 的主要原因是因为您使用了 indexOf 方法来查找位置。 IndexOf ALWAYS 且 ONLY 获取您正在搜索的 String 的第一次出现。

在获得第一个 NP 短语后,您必须在解析树上迭代并更改名词短语的索引,简单的方法可以只是对您的行变量进行子串,并且该子串的起始索引将为 np+1。以下是您可以对代码进行的更改:

while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        int indexOfNP = line.indexOf("(NP");
        while(indexOfNP!=-1) {
            np = findClosingParen(lineArray, indexOfNP);
            line1 = line.substring(indexOfNP, np + 1);
            System.out.print(line1 + "\n");
            npList.add(line1);
            line = line.substring(np+1);
            indexOfNP = line.indexOf("(NP");
            lineArray = line.toCharArray();
        }
}

对于递归解决方案:

public static void main(String[] args) throws IOException {

    ArrayList<String> npList = new ArrayList<String>();
    String line;
    String Input = "/local/Input/Temp/Temp.txt";
    String Output = "/local/Output/Temp/Temp-out.txt";

    FileInputStream fis = new FileInputStream (Input);
    BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));
    while ((line = br.readLine())!= null){
        int indexOfNP = line.indexOf("(NP");
        if(indexOfNP>=0)
            extractNPs(npList,line,indexOfNP);
    }

    for(String npString:npList){
        System.out.println(npString);
    }

    br.close();
    fis.close();

}

public static ArrayList<String> extractNPs(ArrayList<String> arr,String  
                                                   parse, int indexOfNP){
    if(indexOfNP==-1){
        return arr;
    }
    else{
        int npIndex = findClosingParen(parse.toCharArray(), indexOfNP);
        String mainNP = new String(parse.substring(indexOfNP, npIndex + 1));
        arr.add(mainNP);
        //Uncomment Lines below if you also want MainNP along with all NPs     
        //within MainNP to be extracted
        /*
        mainNP = new String(mainNP.substring(3));
        if(mainNP.indexOf("(NP")>0){
            return extractNPs(arr,mainNP,mainNP.indexOf("(NP"));
        }
        */
        parse = new String(parse.substring(npIndex+1));
        indexOfNP = parse.indexOf("(NP");
        return extractNPs(arr,parse,indexOfNP);
    }
}

虽然编写自己的树解析器是一个很好的练习 (!),但如果您只想要结果,最简单的方法是使用 Stanford NLP 工具的更多功能,即 Tregex,这是专为此类事情而设计。您可以将最终的 while 循环更改为如下所示:

TregexPattern tPattern = TregexPattern.compile("NP");
while ((line = br.readLine()) != null) {
    Tree t = Tree.valueOf(line);
    TregexMatcher tMatcher = tPattern.matcher(t);
    while (tMatcher.find()) {
      System.out.println(tMatcher.getMatch());
    }
}