Apache Tika 无法检测短句中的语言。为什么？

Question

我尝试检测短短语上的语言，但很惊讶检测结果是错误的。

    LanguageDetector detector = new OptimaizeLangDetector();
    try {
        detector.loadModels();
    } catch (IOException e) {
        LOG.error(e.getMessage(), e);
        throw new ExceptionInInitializerError(e);
    }
    LanguageResult languageResult = detector.detect("Hello, my friend!")

languageResult 包含挪威语的概率为 "medium"。为什么？我认为它必须是英语。似乎可以正确检测到更长的短语。这是否意味着 Apache Tika 不应该用于短文本？

Answer 1

这不适用于短文本。正如文档中所说：

Implementation of the LanguageDetector API that uses https://github.com/optimaize/language-detector

来自https://tika.apache.org/1.13/api/org/apache/tika/langdetect/OptimaizeLangDetector.html

要查看 github 并检查挑战，他们在短文本方面存在一些问题。

This software does not work as well when the input text to analyze is short, or unclean. For example tweets.

来自他们的 https://github.com/optimaize/language-detector 挑战部门

Answer 2

我可以重现这个问题。它可能不会直接回答问题，但被视为一种解决方法...

看来，如果您知道可以使用哪些语言，则可以通过 loadModels(models) 方法将它们传递给检测器。这种方法有助于正确检测英文：

        try {
            Set<String> models=new HashSet<>();
            models.add("en");
            models.add("ru");
            models.add("de");
            LanguageDetector detector = new OptimaizeLangDetector()
//            .setShortText(true)
            .loadModels(models);
//            .loadModels();
            LanguageResult enResult = detector.detect("Hello, my friend!");
//            LanguageResult ruResult = detector.detect("Привет, мой друг!");
//            LanguageResult deResult = detector.detect("Hallo, mein Freund!");
            System.out.println(enResult.getLanguage());
        } catch (IOException e) {
            throw new ExceptionInInitializerError(e);
        }

Apache Tika 无法检测短句中的语言。为什么？

Apache Tika fails to detect language on short sentence. Why?

java

nlp

apache-tika