如何修复 langdetect 的不稳定结果

Question

我想使用 langdetect 检测文本中的语言。根据 documentation ，我必须设置一个种子才能获得稳定的结果。

Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results everytime you run it. To enforce consistent results, call following code before the first language detection:

如下图，结果好像不行。我错过了什么？

from langdetect import detect, detector_factory, detect_langs

my_string = "Hi, my friend lives next to me. Can you call her? Thibault François. Envoyé depuis mon mobile"

detector_factory.seed = 42

for i in range(5):
    print(detect_langs(my_string), detect(my_string))

结果示例：

[fr:0.7142820855500301, en:0.28571744799229243] en
[fr:0.7142837342663328, en:0.2857140098811736] en
[en:0.571427940246422, fr:0.4285710874902514] fr
[en:0.5714284102904427, fr:0.42857076299207464] fr
[en:0.5714277269187811, fr:0.4285715961184375] fr

Answer 1

如果您使用 DetectorFactory（如文档中建议的那样）而不是 detector_factory，它会起作用。

from langdetect import detect, DetectorFactory, detect_langs

my_string = "Hi, my friend lives next to me. Can you call her? Thibault François. Envoyé depuis mon mobile"

DetectorFactory.seed = 42

for i in range(5):
    print(detect_langs(my_string), detect(my_string))

结果：

[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en
[en:0.5714271973455635, fr:0.42857096898887964] en

Answer 2

Spacy被这句话搞糊涂了，理所当然。问题是弄清楚这一点。设置种子会给出一些稳定但可能仍然不一致的东西。考虑对该代码进行以下非常轻微的修改：

for i in range(5):
    DetectorFactory.seed = 42+i
    print(detect_langs(my_string), detect(my_string))

每次我运行这个，我得到

[en:0.5714271973455635, fr:0.4285709689888797] en
[fr:0.7142849688010372, en:0.2857145735373333] fr
[fr:0.7142834322119054, en:0.2857163285762464] fr
[fr:0.5714278163020392, en:0.4285693437919268] fr
[fr:0.9999946932803276] fr

因此，如果您开始使用 46 而不是 42 的种子，langdetect 会告诉您“我真的确定那是法语”。这种不一致的行为似乎经常发生在两种语言之间平均分配的文本中。我能想出的解决这个问题的最佳策略如下：

N 次（N = 5 或 7 或 ...）以某种稳定的方式设置 DetectorFactory.seed，运行 detect_langs() 并记住结果。
如果前 N 个语言不完全相同，则得出 Spacy 混淆的结论，可能是因为多种语言（如这里的情况）或因为文本太短。
如果所有语言都相同，请查看中位数分数（或最小值或...）如果太低，则还得出 Spacy 混淆的结论。
接受 Spacy 的结果。

强调 Spacy 的困惑：如果我使用 89 作为种子，detect_langs returns

[en:0.7142830387547032, fr:0.2857155716263734]

最后，这个讨论适用于管道的使用。不设置种子，像这样：

    doc = nlp(my_string)
    print(doc._.language["score"])
    print(doc._.language["score"])

可能会打印两个不同的得分值。

如何修复 langdetect 的不稳定结果

How to fix langdetect's instable results

python

language-detection