运行 Stanford CoreNLP 时，某些 HPC 集群是否只缓存一个结果？

Question

我正在为一个 Java 项目使用 Stanford CoreNLP 库。我创建了一个名为 StanfordNLP 的 class 并实例化了两个不同的对象，并使用不同的字符串作为参数初始化了构造函数。我正在使用词性标注器来获取形容词-名词序列。但是，程序的输出只显示了第一个对象的结果。每个 StanfordNLP 对象都使用不同的字符串初始化，但每个对象 returns 与第一个对象的结果相同。我是 Java 的新手，所以我不知道是我的代码有问题还是运行上的 HPC 集群有问题。

我尝试使用 getter，而不是从 StanfordNLP class 方法返回字符串列表。我还尝试将第一个 StanfordNLP 对象设置为 null，这样它就不会引用任何内容，然后创建其他对象。没有任何效果。

/* in main */
List<String> pos_tokens0 = new ArrayList<String>();
List<String> pos_tokens1 = new ArrayList<String>();

String text0 = "Mary little lamb white fleece like snow"
StanfordNLP snlp0 = new StanfordNLP(text0);
pos_tokens0 = snlp0.process();

String text1 = "Everywhere little Mary went fluffy lamb ate green grass"
StanfordNLP snlp1 = new StanfordNLP(text1);
pos_tokens1 = snlp1.process();


/* in StanfordNLP.java */
public class StanfordNLP {

    private static List<String> pos_adjnouns = new ArrayList<String>();
    private String documentText = "";

    public StanfordNLP() {}
    public StanfordNLP(String text) { this.documentText = text; }

    public List<String> process() {     
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");
        props.setProperty("coref.algorithm", "neural");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);    
        Annotation document = new Annotation(documentText);
        pipeline.annotate(document);

        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        List<String[]> corpus_temp = new ArrayList<String[]>();
        int count = 0;
    
        for(CoreMap sentence: sentences) {
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                String[] data = new String[2];
                String word = token.get(TextAnnotation.class);
                String pos = token.get(PartOfSpeechAnnotation.class);
                count ++;

                data[0] = word;
                data[1] = pos;         
                corpus_temp.add(data);
            }           
        }
    
        String[][] corpus = corpus_temp.toArray(new String[count][2]);
    
        // corpus contains string arrays with a word and its part-of-speech.
        for (int i=0; i<(corpus.length-3); i++) { 
            String word = corpus[i][0];
            String pos = corpus[i][1];
            String word2 = corpus[i+1][0];
            String pos2 = corpus[i+1][1];

            // find adjectives and nouns (eg, "fast car")
            if (pos.equals("JJ")) {         
                if (pos2.equals("NN") || pos2.equals("NNP") || pos2.equals("NNPS")) {
                    word = word + " " + word2;
                    pos_adjnouns.add(word);
                }
            }
        }
        return pos_adjnouns;
}

pos_tokens0 的预期输出是“小羊羔，白羊毛”。pos_tokens1 的预期输出是“小玛丽，毛茸茸的羊羔，绿草”。但是这两个变量的实际输出都是“little lamb, white fleece”。

知道为什么会这样吗？我在 HPC 服务器上运行一个带有 main.java 和 myclass.java 的简单 Java jar 文件，无法重现此问题。因此，HPC 服务器似乎没有相同 class.

的多个对象的问题

Answer 1

问题看起来只是您的 pos_adjnouns 变量是 static，因此在 StanfordNLP 的所有实例之间共享...。尝试删除 static 关键字，然后看看是否一切正常。

但是这样还是不对，因为你有一个实例变量并且在多次调用 process() 时，东西会不断被添加到 pos_adjnouns 列表中。您应该做的另外两件事是：

使pos_adjnouns成为process()方法中的方法变量
相反，初始化 StanfordCoreNLP 管道的成本很高，因此您应该将其移出 process() 方法并在 class 构造函数中执行。事情完全相反，构造函数初始化管道，process() 方法采用 String 进行分析可能会更好。

运行 Stanford CoreNLP 时，某些 HPC 集群是否只缓存一个结果？

Do some HPC clusters cache only one result when running Stanford CoreNLP?

nlp

hpc

machine-learning

stanford-nlp