preg_match 和 file_get_contents 和 ø å

Question

我有一个关于 preg_match 的问题，如果我尝试获取如下内容：Århus er en by i Danmark 表示 Århus 是丹麦的一个城市

preg_match( "#<div id=[\"']faktaDiv[\"']>(.*?)</div>#si", $webside, $a2 );

echo $a2;

那么输出将是：

�rhus er en by i Danmark means �rhus is a city in Denmark

我该如何解决这个问题？基本上它需要允许 æ ø å。

Answer 1

对于正则表达式方法，您需要 u 修饰符。有关 PHP 修饰符的完整列表，请参阅 http://php.net/manual/en/reference.pcre.pattern.modifiers.php，您当前使用的 i 和 s 是另外两个修饰符。

preg_match( "#<div id=[\"']faktaDiv[\"']>(.*?)</div>#siu", $webside, $a2 );

不过看起来您正在解析 HTML 所以我会使用 domdocument 来解析该字符串。

$doc = new DOMDocument(); $doc->loadHTML('<div id="faktaDiv">Test Stuff</div>'); $divs = $doc->getElementsByTagName('div'); foreach($divs as $div) { if($div->getAttribute('id') == 'faktaDiv') { echo $div->nodeValue; } }

要提取 title 你应该使用这样的解析器。

$doc = new DOMDocument();
$doc->loadHTML('<title>Test Stuff</title>');
$title = $doc->getElementsByTagName('title')->item(0)->nodeValue;
echo $title;

据我所知应该只有一个title一页。如果不是这种情况，请取消 ->item(0)->nodeValue 并循环遍历数组。

PHP 演示：https://eval.in/502432

Answer 2

您可以使用 \X 来匹配任何 UTF-8 字符（例如点用于 ansi 字符）、特定代码点、代码点范围或 unicode 类别：

http://www.regular-expressions.info/unicode.html

为了回答您的问题，我会说将 (.*?) 替换为 (\X*?) 就足够了。

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, and Ruby 2.0: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

preg_match 和 file_get_contents 和 ø å

preg_match and file_get_contents and æ ø å

php

regex

file-get-contents

preg-match