无法使用 Lingpipe 识别西班牙语文本

Can not identify text in Spanish with Lingpipe

几天前,我正在开发一个 java 服务器来保存一堆数据并识别它的语言,所以我决定使用 lingpipe 来完成这样的任务。但是我遇到了一个问题,在训练代码并用两种语言(英语和西班牙语)对其进行评估后,我无法识别西班牙语文本,但我用英语和法语得到了成功的结果。

我为完成此任务所遵循的教程是: http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

我为完成任务所做的后续步骤: 训练语言分类器的步骤

~1.First 将英语和西班牙语元数据放入名为 leipzig 的文件夹中并解压,如下所示(注意:元数据和句子由 http://wortschatz.uni-leipzig.de/en/download 提供):

leipzig       //Main folder
   1M sentences             //Folder with data of the last trial 
     eng_news_2015_1M
     eng_news_2015_1M.tar.gz
     spa-hn_web_2015_1M
     spa-hn_web_2015_1M.tar.gz
   ClassifyLang.java                //Custom program to try the trained code
   dist                                        //Folder
     eng_news_2015_300K.tar.gz              //unpackaged english sentences
     spa-hn_web_2015_300K.tar.gz            //unpackaged spanish sentences
   EvalLanguageId.java
   langid-leipzig.classifier            //trained code
   lingpipe-4.1.2.jar
   munged                                      //Folder
     eng                    //folder containing the sentences.txt for english
        sentences.txt
     spa                    //folder containing the sentences.txt for spanish
        sentences.txt
   Munge.java
   TrainLanguageId.java
   unpacked                                    //Folder
     eng_news_2015_300K         //Folder with the english metadata 
        eng_news_2015_300K-co_n.txt
        eng_news_2015_300K-co_s.txt
        eng_news_2015_300K-import.sql
        eng_news_2015_300K-inv_so.txt
        eng_news_2015_300K-inv_w.txt
        eng_news_2015_300K-sources.txt
        eng_news_2015_300K-words.txt
        sentences.txt
     spa-hn_web_2015_300K                   //Folder with the spanish metadata 
        sentences.txt
        spa-hn_web_2015_300K-co_n.txt
        spa-hn_web_2015_300K-co_s.txt
        spa-hn_web_2015_300K-import.sql
        spa-hn_web_2015_300K-inv_so.txt
        spa-hn_web_2015_300K-inv_w.txt
        spa-hn_web_2015_300K-sources.txt
        spa-hn_web_2015_300K-words.txt

~2.Second解压压缩的语言元数据到解压文件夹

unpacked                                    //Folder
    eng_news_2015_300K          //Folder with the english metadata 
        eng_news_2015_300K-co_n.txt
        eng_news_2015_300K-co_s.txt
        eng_news_2015_300K-import.sql
        eng_news_2015_300K-inv_so.txt
        eng_news_2015_300K-inv_w.txt
        eng_news_2015_300K-sources.txt
        eng_news_2015_300K-words.txt
        sentences.txt
    spa-hn_web_2015_300K                    //Folder with the spanish metadata 
        sentences.txt
        spa-hn_web_2015_300K-co_n.txt
        spa-hn_web_2015_300K-co_s.txt
        spa-hn_web_2015_300K-import.sql
        spa-hn_web_2015_300K-inv_so.txt
        spa-hn_web_2015_300K-inv_w.txt
        spa-hn_web_2015_300K-sources.txt
        spa-hn_web_2015_300K-words.txt

~3.Then 修改每一个的句子,以删除行号、制表符并用单个 space 字符替换换行符。输出统一使用UTF-8 unicode编码写入(注:灵笛网站munge.java)。

/-----------------Command line----------------------------------------------/

javac -cp lingpipe-4.1.2.jar: Munge.java
java -cp lingpipe-4.1.2.jar: Munge /home/samuel/leipzig/unpacked /home/samuel/leipzig/munged
----------------------------------------Results-----------------------------
spa
reading from=/home/samuel/leipzig/unpacked/spa-hn_web_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/spa/spa.txt charset=utf-8
total length=43267166

eng
reading from=/home/samuel/leipzig/unpacked/eng_news_2015_300K/sentences.txt charset=iso-8859-1
writing to=/home/samuel/leipzig/munged/eng/eng.txt charset=utf-8
total length=35847257

/---------------------------------------------------------------/

<---------------------------------Folder------------------------------------->
   munged                                      //Folder
    eng                     //folder containing the sentences.txt for english
        sentences.txt
    spa                 //folder containing the sentences.txt for spanish
        sentences.txt
<-------------------------------------------------------------------------->

~4.Next 我们从训练语言开始(注:TrainLanguageId.java at Lingpipe LanguageId tutorial)

/---------------Command line--------------------------------------------/

javac -cp lingpipe-4.1.2.jar: TrainLanguageId.java
java -cp lingpipe-4.1.2.jar: TrainLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 5
-----------------------------------Results-----------------------------------
nGram=100000 numChars=5
Training category=eng
Training category=spa

Compiling model to file=/home/samuel/leipzig/langid-leipzig.classifier

/----------------------------------------------------------------------------/

~5。我们用下一个结果评估了我们的训练代码,在混淆矩阵上有一些问题(注意:Lingpipe LanguageId 教程中的 EvalLanguageId.java)。

/------------------------Command line---------------------------------/

javac -cp lingpipe-4.1.2.jar: EvalLanguageId.java
java -cp lingpipe-4.1.2.jar: EvalLanguageId /home/samuel/leipzig/munged /home/samuel/leipzig/langid-leipzig.classifier 100000 50 1000
-------------------------------Results-------------------------------------

Reading classifier from file=/home/samuel/leipzig/langid-leipzig.classifier
Evaluating category=eng
Evaluating category=spa
TEST RESULTS
BASE CLASSIFIER EVALUATION
Categories=[eng, spa]
Total Count=2000
Total Correct=1000
Total Accuracy=0.5
95% Confidence Interval=0.5 +/- 0.02191346617949794
Confusion Matrix
reference \ response
  ,eng,spa
  eng,1000,0                                <---------- not diagonal sampling
  spa,1000,0
Macro-averaged Precision=NaN
Macro-averaged Recall=0.5
Macro-averaged F=NaN
Micro-averaged Results
         the following symmetries are expected:
           TP=TN, FN=FP
           PosRef=PosResp=NegRef=NegResp
           Acc=Prec=Rec=F
  Total=4000
  True Positive=1000
  False Negative=1000
  False Positive=1000
  True Negative=1000
  Positive Reference=2000
  Positive Response=2000
  Negative Reference=2000
  Negative Response=2000
  Accuracy=0.5
  Recall=0.5
  Precision=0.5
  Rejection Recall=0.5
  Rejection Precision=0.5
  F(1)=0.5
  Fowlkes-Mallows=2000.0
  Jaccard Coefficient=0.3333333333333333
  Yule's Q=0.0
  Yule's Y=0.0
  Reference Likelihood=0.5
  Response Likelihood=0.5
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.5
  kappa=0.0
  kappa Unbiased=0.0
  kappa No Prevalence=0.0
  chi Squared=0.0
  phi Squared=0.0
  Accuracy Deviation=0.007905694150420948
Random Accuracy=0.5
Random Accuracy Unbiased=0.625
kappa=0.0
kappa Unbiased=-0.3333333333333333
kappa No Prevalence =0.0
Reference Entropy=1.0
Response Entropy=NaN
Cross Entropy=Infinity
Joint Entropy=1.0
Conditional Entropy=0.0
Mutual Information=0.0
Kullback-Liebler Divergence=Infinity
chi Squared=NaN
chi-Squared Degrees of Freedom=1
phi Squared=NaN
Cramer's V=NaN
lambda A=0.0
lambda B=NaN

ONE VERSUS ALL EVALUATIONS BY CATEGORY


CATEGORY[0]=eng VERSUS ALL

First-Best Precision/Recall Evaluation
  Total=2000
  True Positive=1000
  False Negative=0
  False Positive=1000
  True Negative=0
  Positive Reference=1000
  Positive Response=2000
  Negative Reference=1000
  Negative Response=0
  Accuracy=0.5
  Recall=1.0
  Precision=0.5
  Rejection Recall=0.0
  Rejection Precision=NaN
  F(1)=0.6666666666666666
  Fowlkes-Mallows=1414.2135623730949
  Jaccard Coefficient=0.5
  Yule's Q=NaN
  Yule's Y=NaN
  Reference Likelihood=0.5
  Response Likelihood=1.0
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.625
  kappa=0.0
  kappa Unbiased=-0.3333333333333333
  kappa No Prevalence=0.0
  chi Squared=NaN
  phi Squared=NaN
  Accuracy Deviation=0.011180339887498949


CATEGORY[1]=spa VERSUS ALL

First-Best Precision/Recall Evaluation
  Total=2000
  True Positive=0
  False Negative=1000
  False Positive=0
  True Negative=1000
  Positive Reference=1000
  Positive Response=0
  Negative Reference=1000
  Negative Response=2000
  Accuracy=0.5
  Recall=0.0
  Precision=NaN
  Rejection Recall=1.0
  Rejection Precision=0.5
  F(1)=NaN
  Fowlkes-Mallows=NaN
  Jaccard Coefficient=0.0
  Yule's Q=NaN
  Yule's Y=NaN
  Reference Likelihood=0.5
  Response Likelihood=0.0
  Random Accuracy=0.5
  Random Accuracy Unbiased=0.625
  kappa=0.0
  kappa Unbiased=-0.3333333333333333
  kappa No Prevalence=0.0
  chi Squared=NaN
  phi Squared=NaN
  Accuracy Deviation=0.011180339887498949

/-----------------------------------------------------------------------/

~6.Then我们尝试用西班牙语文本进行真实评价:

/-------------------Command line----------------------------------/

javac -cp lingpipe-4.1.2.jar: ClassifyLang.java
java -cp lingpipe-4.1.2.jar: ClassifyLang

/-------------------------------------------------------------------------/

<---------------------------------Result------------------------------------>
Text:   Yo soy una persona increíble y muy inteligente, me admiro a mi mismo lo que me hace sentir ansiedad de lo que viene, por que es algo grandioso lleno de cosas buenas y de ahora en adelante estaré enfocado y optimista aunque tengo que aclarar que no lo haré por querer algo, sino por que es mi pasión. 
Best    Language:   eng     <------------- Wrong Result

<----------------------------------------------------------------------->

ClassifyLang.java的代码:

import com.aliasi.classify.Classification;
import com.aliasi.classify.Classified;
import com.aliasi.classify.ConfusionMatrix;
import com.aliasi.classify.DynamicLMClassifier;
import com.aliasi.classify.JointClassification;
import com.aliasi.classify.JointClassifier;
import com.aliasi.classify.JointClassifierEvaluator;
import com.aliasi.classify.LMClassifier;

import com.aliasi.lm.NGramProcessLM;

import com.aliasi.util.AbstractExternalizable;

import java.io.File;
import java.io.IOException;

import com.aliasi.util.Files;

public class ClassifyLang {

    public static String text   =   "Yo soy una persona increíble y muy inteligente, me admiro a mi mismo"
                +   " estoy ansioso de lo que viene, por que es algo grandioso lleno de cosas buenas"
                +   " y de ahora en adelante estaré enfocado y optimista"
                +   " aunque tengo que aclarar que no lo haré por querer algo, sino por que no es difícil serlo.    ";

    private static File MODEL_DIR
        = new File("/home/samuel/leipzig/langid-leipzig.classifier");

    public static void main(String[] args)
        throws ClassNotFoundException, IOException {

    System.out.println("Text:   "   +   text);

    LMClassifier    classifier  =   null;
    try {
        classifier  =   (LMClassifier)  AbstractExternalizable.readObject(MODEL_DIR);
        }   catch   (IOException    |   ClassNotFoundException  ex) {
                    //  Handle  exceptions
            System.out.println("Problem with the Model");
        }

    Classification  classification  =   classifier.classify(text);
    String  bestCategory    =   classification.bestCategory();
    System.out.println("Best    Language:   "   +   bestCategory);

        }
}

~7.I 尝试了 100 万个元数据文件,但得到了相同的结果,并且还通过获得相同的结果更改了 ngram 编号。 我将非常感谢你的帮助。

好吧,在自然语言处理领域工作了几天之后,我找到了一种使用 OpenNLP 确定文本语言的方法。 这是示例代码: https://github.com/samuelchapas/languagePredictionOpenNLP/tree/master/TrainingLanguageDecOpenNLP

这里是为进行语言预测而创建的模型的训练语料库。

我决定使用 OpenNLP 来解决这个问题中描述的问题,这个库确实具有完整的功能堆栈。 这是模型训练的示例>

https://mega.nz/#F!HHYHGJ4Q!PY2qfbZr-e0w8tg3cUgAXg