如何从 java 中的文本文件中读取数据以使用 StanfordNLP 提取数据而不是从简单的字符串中读取文本
how to Read data from a text file in java to extract data using StanfordNLP rather than reading text from a simple String
我试过用
Annotation document = new Annotation("这是一个简单的字符串");
也尝试过
CoreDocument coreDocument = new CoreDocument(text);
stanfordCoreNLP.annotate(核心文件);
但无法解决它以从文本文件中读取
如下使用(参见给出的示例here):
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text from the file..
File inputFile = new File("src/test/resources/sample-content.txt");
String text = Files.asCharSource(inputFile, Charset.forName("UTF-8")).read();
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println("word: " + word + " pos: " + pos + " ne:" + ne);
}
更新
或者,要读取文件内容,您可以使用下面使用 Java 的 built-in 包的方法;因此,不需要外部包。根据文本文件中的字符,您可以选择适当的 Charset
。如 here 所述,“ISO-8859-1
是一个 all-inclusive 字符集,从某种意义上说它保证不会抛出 MalformedInputException
”。下面使用 Charset
.
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
...
Path path = Paths.get("sample-content.txt");
String text = "";
try {
text = Files.readString(path, StandardCharsets.ISO_8859_1); //StandardCharsets.UTF_8
} catch (IOException e) {
e.printStackTrace();
}
我试过用 Annotation document = new Annotation("这是一个简单的字符串"); 也尝试过 CoreDocument coreDocument = new CoreDocument(text); stanfordCoreNLP.annotate(核心文件); 但无法解决它以从文本文件中读取
如下使用(参见给出的示例here):
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text from the file..
File inputFile = new File("src/test/resources/sample-content.txt");
String text = Files.asCharSource(inputFile, Charset.forName("UTF-8")).read();
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println("word: " + word + " pos: " + pos + " ne:" + ne);
}
更新
或者,要读取文件内容,您可以使用下面使用 Java 的 built-in 包的方法;因此,不需要外部包。根据文本文件中的字符,您可以选择适当的 Charset
。如 here 所述,“ISO-8859-1
是一个 all-inclusive 字符集,从某种意义上说它保证不会抛出 MalformedInputException
”。下面使用 Charset
.
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
...
Path path = Paths.get("sample-content.txt");
String text = "";
try {
text = Files.readString(path, StandardCharsets.ISO_8859_1); //StandardCharsets.UTF_8
} catch (IOException e) {
e.printStackTrace();
}