使用 Stanford CorNLP 手动标记单词
Manual tagging of Words using Stanford CorNLP
我有一个资源,我可以准确地知道单词的类型。我必须对它们进行词形还原,但为了获得正确的结果,我必须手动标记它们。我找不到任何用于手动标记单词的代码。我正在使用以下代码,但结果 returns 错误。即 "painting" 对于 "painting" 我期望 "paint".
*//...........lemmatization starts........................
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting";
Annotation document = pipeline.process(text);
List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);
for(edu.stanford.nlp.util.CoreMap sentence: sentences)
{
for(CoreLabel token: sentence.get(TokensAnnotation.class))
{
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
System.out.println("lemmatized version :" + lemma);
}
}
//...........lemmatization ends.........................*
我必须 运行 对单词进行词形还原,而不是自动完成 pos 标记的句子。所以我会先手动标记单词,然后找到它们的引理。帮助提供一些示例代码或参考一些网站会很有帮助。
如果您事先知道 POS 标签,您可以通过以下方式获取引理:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting";
Morphology morphology = new Morphology();
Annotation document = pipeline.process(text);
List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);
for(edu.stanford.nlp.util.CoreMap sentence: sentences) {
for(CoreLabel token: sentence.get(TokensAnnotation.class)) {
String word = token.get(TextAnnotation.class);
String tag = ... //get the tag for the current word from somewhere, e.g. an array
String lemma = morphology.lemma(word, tag);
System.out.println("lemmatized version :" + lemma);
}
}
如果你只想得到一个词的引理,你甚至不需要 运行 CoreNLP 来进行标记化和句子分割,所以你可以像下面这样调用引理函数:
String tag = "VBG";
String word = "painting";
Morphology morphology = new Morphology();
String lemma = morphology.lemma(word, tag);
我有一个资源,我可以准确地知道单词的类型。我必须对它们进行词形还原,但为了获得正确的结果,我必须手动标记它们。我找不到任何用于手动标记单词的代码。我正在使用以下代码,但结果 returns 错误。即 "painting" 对于 "painting" 我期望 "paint".
*//...........lemmatization starts........................
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting";
Annotation document = pipeline.process(text);
List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);
for(edu.stanford.nlp.util.CoreMap sentence: sentences)
{
for(CoreLabel token: sentence.get(TokensAnnotation.class))
{
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
System.out.println("lemmatized version :" + lemma);
}
}
//...........lemmatization ends.........................*
我必须 运行 对单词进行词形还原,而不是自动完成 pos 标记的句子。所以我会先手动标记单词,然后找到它们的引理。帮助提供一些示例代码或参考一些网站会很有帮助。
如果您事先知道 POS 标签,您可以通过以下方式获取引理:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
String text = "painting";
Morphology morphology = new Morphology();
Annotation document = pipeline.process(text);
List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);
for(edu.stanford.nlp.util.CoreMap sentence: sentences) {
for(CoreLabel token: sentence.get(TokensAnnotation.class)) {
String word = token.get(TextAnnotation.class);
String tag = ... //get the tag for the current word from somewhere, e.g. an array
String lemma = morphology.lemma(word, tag);
System.out.println("lemmatized version :" + lemma);
}
}
如果你只想得到一个词的引理,你甚至不需要 运行 CoreNLP 来进行标记化和句子分割,所以你可以像下面这样调用引理函数:
String tag = "VBG";
String word = "painting";
Morphology morphology = new Morphology();
String lemma = morphology.lemma(word, tag);