正则表达式 - URL 中的希腊字符

Question

我有一个使用正则表达式的自定义路由器。

问题是我无法解析希腊字符。

以下是 index.php 中的一些行：

$router->get('/theatre/plays', 'TheatreController', 'showPlays');
$router->get('/theatre/interviews', 'TheatreController', 'showInterviews');
$router->get('/theatre/[-\w\d\!\.]+', 'TheatreController', 'single_post');

以下是 Router.php 中的一些行：

$found = 0;
$path = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH); //get the url

////// Bla Bla Bla /////////

if ( $found = preg_match("#^$value$#", $path) )
{
    //Do stuff
}

现在，当我尝试像 http://kourtis.app/theatre/α 这样的 url（注意最后一个字符是希腊语 'alpha'）时，它会以某种方式被解释为 http://kourtis.app/theatre/%CE%B1

我在 var_dump($path) 或复制粘贴 url 时可以看到这个。

我想这与编码有关，但所有（我能想到的）都是 utf-8 格式。

有什么想法吗？

--------------------------------

更新：根据评论中的建议，以下仅适用于一些希腊字符： /theatre/[α-ωΑ-Ω-\w\d\!\.]+ 并使用 urldecode 解码 $path 变量的百分比编码。

一些产生错误的字符是：κ π ρ χ.

现在的问题是……为什么？？（顺便说一句，这适用于许多字符 /theatre/.+）

Answer 1

您可以使用

$router->get('/theatre/[^/]+', 'TheatreController', 'single_post');

as [^/]+ 将匹配 除 / 之外的一个或多个字符，因为 [^...] 是一个 否定字符 class 匹配 class.

中定义的任何字符

请注意，如果您使用 \w（\w 已经匹配数字），则不必使用 \d。

此外，您没有将变音符号与您的正则表达式匹配。如果您需要匹配变音符号，请将 \p{M} 添加到正则表达式：'/theatre/[-\w\p{M}!.]+'。

请注意，要允许 \w 匹配 Unicode letters/digits，您需要将 /u 修饰符传递给正则表达式：$found = preg_match("#^$value$#u", $path)。这会将输入字符串视为 Unicode 字符串，并使 shorthand 模式像 \w 识别 Unicode。

另一件事：你不需要在字符 class.

内转义 .

图案详情:

#...# - 正则表达式分隔符
^ - 字符串开头
$value - $value 变量内容（因为 PHP 中的双引号字符串允许插值）
$ - 字符串结尾
#u - 启用 PCRE_UTF 和 PCRE_UCP 选项的修饰符。在此处查看有关它们的更多信息

正则表达式 - URL 中的希腊字符

Regex - Greek Characters in URL

php

regex

url

routing

url-encoding

--------------------------------