为什么在 apache opennlp 1.8 中分类字符串 [ ] 而不是字符串？

Question

为什么 myCategorizer.categorize(); 的输入必须是 apache opennlp 1.8 中的 String[] 而不是 apache OpenNLP 1.5 版本中的 String？

因为我想检查单独的字符串而不是数组？

 public void trainModel() 
    {
        InputStream dataIn = null;
        try 
        {;
            dataIn = new FileInputStream("D:/training.txt");
            ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
            ObjectStream sampleStream = new DocumentSampleStream(lineStream);
            // Specifies the minimum number of times a feature must be seen
            int cutoff = 2;
            int trainingIterations = 30;
            model = DocumentCategorizerME.train("NL", sampleStream, cutoff,trainingIterations);


        } 

        catch (IOException e) 
        {
            e.printStackTrace();
        } 

        finally 
        {
            if (dataIn != null) 
            {
                try 
                {
                    dataIn.close();
                } 
                catch (IOException e) 
                {
                    e.printStackTrace();
                }
            }
        }
    }


public void classifyNewTweet(String tweet) 
{
    DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
    double[] outcomes = myCategorizer.categorize(tweet);
    String category = myCategorizer.getBestCategory(outcomes);

    if (category.equalsIgnoreCase("1")) 
    {
        System.out.println("The tweet is positive :) ");
    } 
    else 
    {
        System.out.println("The tweet is negative :( ");
    }
}

Answer 1

早在 OpenNLP 1.5 时代，DocumentCatagorizer 所做的第一件事就是将您的字符串标记为单词。起初，这可能看起来很容易，但是，您可能更喜欢使用最大熵分词器而不是默认的 WhitespaceTokenizer。分词器可以对分类产生很大的影响。更改 API 以允许用户选择 his/her 选择的分词器可以缓解该问题。只需添加

Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
...
String[] tokens = tokenizer.tokenize(tweet);
double[] outcomes = myCategorizer.categorize(tweet);
...

这应该可以解决您的问题。您还可以使用统计分词器（参见 TokenizerME）或 SimpleTokenizer。

为什么在 apache opennlp 1.8 中分类字符串 [ ] 而不是字符串？

Why is categorize a String[ ] in apache opennlp 1.8 instead of a String?

java

document

opennlp