正则表达式模式适用于字符串但不适用于加载的文件内容

Question

我想提取“;”之间的单词和 XML 文件中的“:”，例如此处的单词“Index”

bla bla bla ; Index : bla bla

文件由其 URL 使用 file_get_contents

加载

$output = file_get_contents("https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es");
       
 preg_match_all('/\;.[a-zA-Z]+.\:/', $output, $matches, PREG_SET_ORDER, 0);
 var_dump($matches);

正则表达式模式在使用 regex101 的相同文件内容上工作正常，当我将文本复制到字符串变量中时也是如此。但是上面的代码不起作用，它 returns 只有最后一场比赛。

我做错了什么？

PS ：我还尝试使用 DOMDocument 加载 XML 文件。结果相同。

Answer 1

一种内存占用少的方法，几个注意事项：

文件很大（不是很大但很大）。
您正在处理 xml 文件这一事实对于这种情况不是很重要，因为您要查找的文本遵循它自己的基于行的格式 (XWiki format 标准定义） 独立于 xml 格式。 但是，如果您绝对想在此处使用 XML 解析器来提取 text 标记内容，我建议使用 XMLReader 代替 DOMDocument。
您要查找的行始终是单行，以 ; （无缩进） 开头，并且始终紧跟 :下一行。

一旦你看到 （右键单击，源代码），你可以选择逐行读取文件（而不是用 file_get_contents 加载整个文件）并使用生成器函数来 select 有趣的行：

$url = 'https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es';

$handle = fopen($url, 'rb');

function filterLines($handle) {
    while (feof($handle) !== true) {
        $line = fgets($handle);
        if ( $line[0] == ';' ) {
            $temp = $line;
            continue;
        } 
        if ( $line[0] == ':' && $temp )
            yield $temp;            

        $temp = false;
    }
}

foreach (filterLines($handle) as $line) {
    if ( preg_match_all('~\b\p{Latin}+(?: \p{Latin}+)*\b~u', $line, $matches) )
        echo implode(', ', $matches[0]), PHP_EOL;
}

fclose($handle);

正则表达式模式适用于字符串但不适用于加载的文件内容

Regex pattern works on string but not on loaded file content

php

regex

string

file-get-contents

preg-match