在 Java 中使用正则表达式模式检测非拉丁字符

Question

我认为拉丁字符是我在问题中的意思，但我不完全确定正确的分类是什么。我正在尝试使用正则表达式模式来测试字符串是否包含非拉丁字符。我期待以下结果

"abcDE 123";  // Yes, this should match
"!@#$%^&*";   // Yes, this should match
"aaàààäää";   // Yes, this should match
"ベビードラ";   // No, this shouldn't match
"";  // No, this shouldn't match

我的理解是内置 {IsLatin} 预设只是检测是否有任何字符是拉丁字符。我想检测是否有任何字符不是拉丁文。

Pattern LatinPattern = Pattern.compile("\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
    System.out.println("is NON latin");
    return;
}
System.out.println("is latin");

Answer 1

TL;DR: 使用正则表达式 ^[\p{Print}\p{IsLatin}]*$

如果字符串包含以下内容，您需要一个匹配的正则表达式：

空间
位数
标点符号
拉丁字符（Unicode 脚本“拉丁文”）

最简单的方法是将 \p{IsLatin} 与 \p{Print} 组合，其中 Pattern 将 \p{Print} 定义为：

\p{Print} - 可打印字符：[\p{Graph}\x20]
- \p{Graph} - 可见字符：[\p{Alnum}\p{Punct}]
  - \p{Alnum} - 字母数字字符：[\p{Alpha}\p{Digit}]
    - \p{Alpha} - 一个字母字符：[\p{Lower}\p{Upper}]
      - \p{Lower} - 小写字母字符：[a-z]
      - \p{Upper} - 大写字母字符：[A-Z]
    - \p{Digit} - 十进制数字：[0-9]
  - \p{Punct} - 标点符号：!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
- \x20 - A space:

这使得 \p{Print} 与 [\p{ASCII}&&\P{Cntrl}] 相同，即不是控制字符的 ASCII 字符。

\p{Alpha} 部分与 \p{IsLatin} 重叠，但这没关系，因为字符 class 消除了重复项。

所以，正则表达式是：^[\p{Print}\p{IsLatin}]*$

测试

Pattern latinPattern = Pattern.compile("^[\p{Print}\p{IsLatin}]*$");

String[] inputs = { "abcDE 123", "!@#$%^&*", "aaàààäää", "ベビードラ", "" };
for (String input : inputs) {
    System.out.print("\"" + input + "\": ");
    Matcher matcher = latinPattern.matcher(input);
    if (! matcher.find()) {
        System.out.println("is NON latin");
    } else {
        System.out.println("is latin");
    }
}

输出

"abcDE 123": is latin
"!@#$%^&*": is latin
"aaàààäää": is latin
"ベビードラ": is NON latin
"": is NON latin

Answer 2

全部Latin Unicode character classes是：

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F

所以，答案是

Pattern LatinPattern = Pattern.compile("^[\p{InBasicLatin}\p{InLatin-1Supplement}\p{InLatinExtended-A}\p{InLatinExtended-B}]+$");
Pattern LatinPattern = Pattern.compile("^[\x00-\x{024F}]+$"); //U+0000-U+024F

请注意，Java.

中的 Unicode 属性 class 名称中的下划线已被删除

参见Java demo：

List<String> strs = Arrays.asList(
        "abcDE 123",  // Yes, this should match
        "!@#$%^&*",   // Yes, this should match
        "aaàààäää",   // Yes, this should match
        "ベビードラ", // No, this shouldn't match
        "");     // No, this shouldn't match  
Pattern LatinPattern = Pattern.compile("^[\p{InBasicLatin}\p{InLatin-1Supplement}\p{InLatinExtended-A}\p{InLatinExtended-B}]+$");
//Pattern LatinPattern = Pattern.compile("^[\x00-\x{024F}]+$"); //U+0000-U+024F
for (String str : strs) {
    Matcher matcher = LatinPattern.matcher(str);
    if (!matcher.find()) {
        System.out.println(str + " => is NON Latin");
        //return;
    } else {
        System.out.println(str + " => is Latin");
    }
}

注意：如果将.find()替换为.matches()，则可以丢弃模式中的^和$。

输出：

abcDE 123 => is Latin
!@#$%^&* => is Latin
aaàààäää => is Latin
ベビードラ => is NON Latin
 => is NON Latin

在 Java 中使用正则表达式模式检测非拉丁字符

Detect non Latin characters with regex Pattern in Java

java

regex

latin