这个混合字符串如何在 unicode 字边界上拆分

How this mixed-character string split on unicode word boundaries

考虑字符串 "abc를"。根据unicode的demo implementation of word segmentation,这个字符串应该被拆分成两个词,"abc""를"。然而,3 种不同的 Rust 词边界检测实现(regexunic-segmentunicode-segmentationall 不同意,并将该字符串分组为一个词.哪种行为是正确的?

作为后续行动,如果分组行为是正确的,那么以一种仍然主要尊重单词边界的方式扫描此字符串以查找搜索词“abc”的好方法是什么(为了检查字符串翻译的有效性)。我想匹配 "abc를" 之类的内容,但不匹配 abcdef.

之类的内容

我不太确定分词演示是否应该作为基本事实,即使它在官方网站上也是如此。例如,它认为 "abc를" ("abc\uB97C") 是两个独立的词,但认为 "abc를" ("abc\u1105\u1173\u11af") 是一个词,即使前者分解为后者。

单词边界的想法并不是一成不变的。 Unicode 有一个 Word Boundary 规范,它概述了应该和不应该出现断字的地方。然而,它有一个广泛的 notes 部分用于详细说明其他情况(强调我的):

It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

...

我的理解是,您列出的板条箱遵循规范,没有进一步的上下文分析。为什么演示不同意我不能说,但它可能是尝试实现这些边缘案例之一。


为了解决您的具体问题,我建议使用 Regex with \b for matching a word boundary. This unfortunately follows the same unicode rules and will not consider "를" to be a new word. However, this regex implementation offers an escape hatch 回退到 ascii 行为。只需使用 (?-u:\b) 来匹配非 unicode 边界:

use regex::Regex;

fn main() {
    let pattern = Regex::new("(?-u:\b)abc(?-u:\b)").unwrap();
    println!("{:?}", pattern.find("some abcdef abc를 sentence"));
}

您可以 运行 自己在 playground 上测试您的案例,看看它是否适合您。