如何检测 iOS 中的文本(字符串)语言?

How to detect text (string) language in iOS?

例如,给定以下字符串:

let textEN = "The quick brown fox jumps over the lazy dog"
let textES = "El zorro marrón rápido salta sobre el perro perezoso"
let textAR = "الثعلب البني السريع يقفز فوق الكلب الكسول"
let textDE = "Der schnelle braune Fuchs springt über den faulen Hund"

我想检测他们每个人使用的语言。

我们假设已实现函数的签名是:

func detectedLanguage<T: StringProtocol>(_ forString: T) -> String?

returns 可选 字符串以防未检测到语言。

因此适当的结果将是:

let englishDetectedLanguage = detectedLanguage(textEN) // => English
let spanishDetectedLanguage = detectedLanguage(textES) // => Spanish
let arabicDetectedLanguage = detectedLanguage(textAR) // => Arabic
let germanDetectedLanguage = detectedLanguage(textDE) // => German

有没有简单的实现方法?

最新版本(iOS 12+)

简要说明:

您可以使用 NLLanguageRecognizer 来实现它,如:

import NaturalLanguage

func detectedLanguage(for string: String) -> String? {
    let recognizer = NLLanguageRecognizer()
    recognizer.processString(string)
    guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
    return detectedLanguage
}

旧版本(iOS 11+)

简要说明:

您可以使用 NSLinguisticTagger 来实现它,如:

func detectedLanguage<T: StringProtocol>(for string: T) -> String? {
    let recognizer = NLLanguageRecognizer()
    recognizer.processString(String(string))
    guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
    return detectedLanguage
}

详情:

首先你要知道你问的主要是Natural language processing (NLP).

的世界

由于NLP不仅仅是文本语言检测,其余答案将不包含具体的NLP信息。

显然,实现这样的功能并不容易,尤其是当开始关心过程的细节时,例如拆分成句子甚至单词,然后识别名称和标点符号等......我打赌你会想到"what a painful process! it is not even logical to do it by myself";幸运的是,iOS 支持 NLP(实际上,NLP API 可用于所有 Apple 平台,而不仅仅是 iOS)以实现您的目标易于实施。您将使用的核心组件是 NSLinguisticTagger:

Analyze natural language text to tag part of speech and lexical class, identify names, perform lemmatization, and determine the language and script.

NSLinguisticTagger provides a uniform interface to a variety of natural language processing functionality with support for many different languages and scripts. You can use this class to segment natural language text into paragraphs, sentences, or words, and tag information about those segments, such as part of speech, lexical class, lemma, script, and language.

如 class 文档中所述,您正在寻找的方法 - 在 确定主导语言和正字法 部分 - 是 dominantLanguage(for:)

Returns the dominant language for the specified string.

.

.

Return Value

The BCP-47 tag identifying the dominant language of the string, or the tag "und" if a specific language cannot be determined.

您可能会注意到 NSLinguisticTagger 自从回到 iOS 5 就存在了。但是,dominantLanguage(for:) 方法 iOS 11及以上,那是因为它是在Core ML Framework:

之上开发的

. . .

Core ML is the foundation for domain-specific frameworks and functionality. Core ML supports Vision for image analysis, Foundation for natural language processing (for example, the NSLinguisticTagger class), and GameplayKit for evaluating learned decision trees. Core ML itself builds on top of low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders.

基于通过传递 "The quick brown fox jumps over the lazy dog" 调用 dominantLanguage(for:) 的返回值:

NSLinguisticTagger.dominantLanguage(for: "The quick brown fox jumps over the lazy dog")

将是 "en" 可选字符串。但是,到目前为止,这不是期望的输出,期望得到的是 "English" !好吧,这正是您通过调用 localizedString(forLanguageCode:) method from Locale 结构并传递获取的语言代码应该得到的:

Locale.current.localizedString(forIdentifier: "en") // English

全部放在一起:

如 "Quick Answer" 代码片段中所述,函数为:

func detectedLanguage<T: StringProtocol>(_ forString: T) -> String? {
    guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else {
        return nil
    }

    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)

    return detectedLanguage
}

输出:

符合预期:

let englishDetectedLanguage = detectedLanguage(textEN) // => English
let spanishDetectedLanguage = detectedLanguage(textES) // => Spanish
let arabicDetectedLanguage = detectedLanguage(textAR) // => Arabic
let germanDetectedLanguage = detectedLanguage(textDE) // => German

注意:

仍然存在无法获取给定字符串的语言名称的情况,例如:

let textUND = "SdsOE"
let undefinedDetectedLanguage = detectedLanguage(textUND) // => Unknown language

或者甚至 nil:

let rubbish = "000747322"
let rubbishDetectedLanguage = detectedLanguage(rubbish) // => nil

仍然觉得提供有用的输出结果不错...


此外:

关于 NSLinguisticTagger:

虽然我不会深入探讨 NSLinguisticTagger 用法,但我想指出的是,它有几个非常酷的功能,而不仅仅是简单地检测给定文本的语言;作为一个非常简单的 示例 :在枚举标签时使用 引理 在使用 Information retrieval 时非常有用,因为您可以识别单词 "driving" 传递 "drive" 单词。

官方资源

Apple 视频会话

此外,为了熟悉 CoreML:

您可以使用 NSLinguisticTagger 的 tagAt 方法。它支持 iOS 5 及更高版本。

func detectLanguage<T: StringProtocol>(for text: T) -> String? {
    let tagger = NSLinguisticTagger.init(tagSchemes: [.language], options: 0)
    tagger.string = String(text)

    guard let languageCode = tagger.tag(at: 0, scheme: .language, tokenRange: nil, sentenceRange: nil) else { return nil }
    return Locale.current.localizedString(forIdentifier: languageCode)
}

detectLanguage(for: "The quick brown fox jumps over the lazy dog")              // English
detectLanguage(for: "El zorro marrón rápido salta sobre el perro perezoso")     // Spanish
detectLanguage(for: "الثعلب البني السريع يقفز فوق الكلب الكسول")                // Arabic
detectLanguage(for: "Der schnelle braune Fuchs springt über den faulen Hund")   // German

我尝试 NSLinguisticTagger 使用像 hello 这样的短输入文本,它总是识别为意大利语。 幸运的是,Apple 最近在 iOS 12 上添加了 NLLanguageRecognizer,而且看起来更准确 :D

import NaturalLanguage

if #available(iOS 12.0, *) {
    let languageRecognizer = NLLanguageRecognizer()
    languageRecognizer.processString(text)
    let code = languageRecognizer.dominantLanguage!.rawValue
    let language = Locale.current.localizedString(forIdentifier: code)
}