PHP - 如何通过逐行读取文本文件来提取块

PHP - How to extract blocks from a text file by reading it line-by-line

我有一个如下所示的输入文本文件:

BEGIN
#1 
#2 
#3 
#4 
#5 
#6 
1       2015-05-31  2001-11-24  'Name Surname'      ID_1        0 
2       2011-04-01  ?           ?                   ID_2        1 
2       2013-02-24  ?           ?                   ID_3        1 
2       2014-02-28  ?           'Name Surname'      ID_4        2 
END
#7      'value 1'
#8      'value 2'
#9      'value 3'
#10     'value 4'
END

当文本文件中有一个 BEGIN 时,从那里开始一个循环,其中以 # 开头的每一行都是一个键,而相对值是每个后续行的列,直到 END,生成如下数组:

Array ( [#1] => Array ( [0] => 1 [1] => 2 [2] => 2 [3] => 2 ) [#2] => Array ( [0] => 2015-05-31 [1] => 2011-04-01 [2] => 2013-02-24 [3] => 2014-02-28 ) [#3] => Array ( [0] => 2001-11-24 [1] => ? [2] => ? [3] => ? ) [#4] => Array ( [0] => 'Name Surname' [1] => ? [2] => ? [3] => 'Name Surname' ) [#5] => Array ( [0] => ID_1 [1] => ID_2 [2] => ID_3 [3] => ID_4 ) [#6] => Array ( [0] => 0 [1] => 1 [2] => 1 [3] => 2 ) )

否则,如果在文本文件中没有BEGIN,但你找到了以#开头的行,它的相对值是单引号之间的那个,生成一个数组如下:

Array ( [#7] => 'value 1' [#8] => 'value 2' [#9] => 'value 3' [#10] => 'value 4' )

这就是我想要得到的,我现在的代码如下:

<?php
    $time = microtime();
    $time = explode(' ', $time);
    $time = $time[1] + $time[0];
    $start = $time;

    ini_set("max_execution_time", 300); // 300 seconds = 5 minutes
    ini_set("pcre.backtrack_limit", "100000000"); // default 100k = "100000"
    ini_set("memory_limit", "1024M");

    $txt_path = "./test_2.txt";
    $txt_data = @file_get_contents($txt_path) or die("Could not access file: $txt_path");
    //echo $txt_data;

    /* BEGIN ARRAY FOR LOOP ENTRIES */

    $loop_pattern = "/BEGIN(.*?)END/s";
    preg_match_all($loop_pattern, $txt_data, $matches);
    $loops = $matches[0];
    $loops_count = count($loops);
    //echo("<br><br>".$loops_count."<br><br>");

    foreach ($loops as $key => $value) {
        $value = trim($value);
        $pattern = array("/BEGIN(.*?)/", "/END(.*?)/", "/[[:blank:]]+/");
        $replacement = array("", "", " ");
        $value = preg_replace($pattern, $replacement, $value);
        //echo $value."<br><br>";

        preg_match_all( '/^#\d+/m', $value, $matches );
        $keys = $matches[0];
        //print_r($keys);
        //echo "<br><br>";

        $value = preg_replace( '/^#\d+\s*/m', '', $value );

        $value = str_replace( "\n", " ", $value );

        $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", count( $keys ) ).'/';

        preg_match_all( $pattern, $value, $matches );
        //print_r($matches);
        //echo "<br><br>";

        $loop_dic = array_combine( $keys, array_slice( $matches, 1 ) );

        print_r( $loop_dic );
        echo("<br><br>");
    }

    /* END ARRAY FOR LOOP ENTRIES */

    /* BEGIN ARRAY FOR NO LOOP ENTRIES */

    $txt_data_without_loops = preg_replace( "/BEGIN(.*?)END/s", "", $txt_data );
    //echo $txt_data_without_loops;

    $pattern = array("/END(.*?)/", "/[[:blank:]]+/");
    $replacement = array("", " ");
    $txt_data_without_loops_clean = preg_replace($pattern, $replacement, $txt_data_without_loops);
    //echo $txt_data_without_loops_clean;
    preg_match_all( '/^#(.*?)\S+/m', $txt_data_without_loops_clean, $matches );
    $keys = $matches[0];
    //print_r($keys);
    $txt_data_without_loops_clean = preg_replace( '/^#(.*?)\S+\s*/m', '', $txt_data_without_loops_clean );
    //print_r($txt_data_without_loops_clean);

    $txt_data_without_loops_clean_no_newline = str_replace( "\n", " ", $txt_data_without_loops_clean );
    //print_r($txt_data_without_loops_clean_no_newline);
    $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", 1 ).'/';
    preg_match_all( $pattern, $txt_data_without_loops_clean_no_newline, $matches );
    //print_r( $matches[0] );

    $no_loop_dic = array_combine( $keys, $matches[0] );
    print_r( $no_loop_dic );
    echo("<br><br>");

    /* END ARRAY FOR NO LOOP ENTRIES */

    $time = microtime();
    $time = explode(' ', $time);
    $time = $time[1] + $time[0];
    $finish = $time;
    $total_time = round(($finish - $start), 4);
    echo '<br><br><b>Page generated in '.$total_time.' seconds.</b><br><br>';
?>

作为第一种方法,为了获得 BEGIN-END 循环和相关数组,我读取了输入文件:

$txt_path = "./input.txt";
$txt_data = @file_get_contents($txt_path) or die("<b>Could not access file: $txt_path</b><br><br>");

这适用于小文件,但对于大输入文件,它会在浏览器中产生无响应时间(我正在 Firefox 上测试),可能是为了 RAM 饱和来解析整个大文件(我的笔记本电脑有 3GB 内存)。

我在 php 文件中尝试了以下设置:

ini_set("max_execution_time", 300); // 300 seconds = 5 minutes
ini_set("pcre.backtrack_limit", "100000000"); // default 100k = "100000"
ini_set("memory_limit", "1024M");

这似乎解决了一些文件大小不那么大的问题,而对于大文件,该过程已经完成而没有错误,只是没有在同一时刻使用很多资源...所以,这不是这不是最好的解决方案。

在网络上搜索,我发现 this page 我读到的地方:

If you're reading files, read them line-by-line instead of reading in the complete file into memory. Look at fgets and SplFileObject::fgets.

所以我决定使用 fgets 来读取和解析整个输入文件。 在为所有行生成一个数组后,我需要从每个循环中提取它,将其添加到 loops_array,同时我会将其他 no_loop 键值对添加到另一个数组。

我的想法,好像很快,就是找到每个BEGIN的索引,这样:

$txt_path = "./test.txt";
$txt_data = @fopen($txt_path, "rb") or die("<b>Could not access file: $txt_path</b><br/><br/>");

$lines = array();
while ( !feof($txt_data) ) {
    $line = fgets($txt_data, 1024);
    //echo($line."<br/><br/>");
    array_push($lines, trim($line));
}

$lines = array_filter($lines);
//print_r($lines);
//echo("<br/><br/>");

$begins = array_keys($lines, "BEGIN");
//echo("<b>Begins:</b><br/><br/>");
//print_r($begins);
//echo("<br/><br/>");

但现在我需要在 $begins 数组中的每个元素之后找到第一个 END 的索引...如果我这样做:

$ends = array_keys($lines, "END");
//echo("<b>Ends:</b><br/><br/>");
//print_r($ends);
//echo("<br/><br/>");

它还考虑了输入文件 no_loop 区域中的 END 字符串,而我应该在每个 BEGIN,然后将它们与:

$begins_ends = array_combine($begins, $ends);

并使用 array_slice 提取所有循环,最后将每个 $loop 添加到一个新数组 $loops,就像这样:

$i = 0;
$loops = array();
foreach ($begins_ends as $key => $value) {
    $begin = trim($key);
    $end = trim($value);
    $loop = array_slice( $lines, $begin, ($end - $begin), false );
    $this_loop = array();
    for ($el=$begin; $el < $end+1; $el++) {
        array_push($this_loop, $lines[$el]);
        unset($lines[$el]);
    }
    array_push($loops, $this_loop);
    $loop = array_values($lines);
    //echo("<b>Loops Dictionary $i:</b><br/><br/>");
    //print_r($loop);
    //echo("<br/><br/>");
    $i++;
}
//print_r($loops);
//echo("<br/><br/>");

问题是获取正确的$ends数组,没有考虑输入文件中no_loop区的END字符串,获取之前的输出:

Array ( [#1] => Array ( [0] => 1 [1] => 2 [2] => 2 [3] => 2 ) [#2] => Array ( [0] => 2015-05-31 [1] => 2011-04-01 [2] => 2013-02-24 [3] => 2014-02-28 ) [#3] => Array ( [0] => 2001-11-24 [1] => ? [2] => ? [3] => ? ) [#4] => Array ( [0] => 'Name Surname' [1] => ? [2] => ? [3] => 'Name Surname' ) [#5] => Array ( [0] => ID_1 [1] => ID_2 [2] => ID_3 [3] => ID_4 ) [#6] => Array ( [0] => 0 [1] => 1 [2] => 1 [3] => 2 ) )

Array ( [#7] => 'value 1' [#8] => 'value 2' [#9] => 'value 3' [#10] => 'value 4' )

以最快的方式和最低的内存占用,解决大文件浏览器无响应的问题。

谢谢

单纯有用的说没必要用fgets(), but fread(); the source of the information is here!

正如您在那里看到的那样,file() is very similar to the previously used file_get_contents(),所以应该没有什么区别。

以前的工作代码应该以如此简单的方式进行改编:

  • test_2.txt 文件内容:

BEGIN
#1 
#2 
#3 
#4 
#5 
#6 
1       2015-05-31  2001-11-24  'Name Surname'      ID_1        0 
2       2011-04-01  ?           ?                   ID_2        1 
2       2013-02-24  ?           ?                   ID_3        1 
2       2014-02-28  ?           'Name Surname'      ID_4        2 
END
#7      'value 1'
#8      'value 2'
#9      'value 3'
#10     'value 4'
END
BEGIN
#11 
#12 
#13 
#14 
#15 
#16 
1       2015-05-31  2001-11-24  'Name Surname'      ID_5        0 
2       2011-04-01  ?           ?                   ID_6        1 
2       2013-02-24  ?           ?                   ID_7        1 
2       2014-02-28  ?           'Name Surname'      ID_8        2 
END
BEGIN
#17 
#18 
#19 
#20 
#21 
#22 
1       2015-05-31  2001-11-24  'Name Surname'      ID_9        0 
2       2011-04-01  ?           ?                   ID_10        1 
2       2013-02-24  ?           ?                   ID_11        1 
2       2014-02-28  ?           'Name Surname'      ID_12        2 
END
  • PHP代码:

<?php
$time = microtime();
$time = explode(" ", $time);
$time = $time[1] + $time[0];
$start = $time;

$filename = "./test_2.txt";
$handle = fopen($filename, "rb") or die("<b>Could not access file: $filename</b><br/><br/>");
$contents = fread($handle, filesize($filename));
fclose($handle);

//echo($contents."<br><br>");

$loop_pattern = "/BEGIN(.*?)END/s";
preg_match_all($loop_pattern, $contents, $matches);
$loops = $matches[0];
//print_r($loops);
//echo("<br><br>");
$loops_count = count($loops);
//print_r($loops_count);
//echo "<br><br>";

foreach ($loops as $key => $value) {
    $value = trim($value);
    //echo($value."<br><br>");
    $pattern = array("/[[:blank:]]+/", "/BEGIN(.*)/", "/END(.*)/");
    $replacement = array(" ", "", "");
    $value = preg_replace($pattern, $replacement, $value);
    //echo($value."<br><br>");

    preg_match_all( '/^#\d+/m', $value, $matches );
    $keys = $matches[0];
    //print_r($keys);
    //echo "<br><br>";

    $value = preg_replace( '/^#\d+\s*/m', '', $value );

    $value = str_replace( "\n", " ", $value );

    $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", count( $keys ) ).'/';
    preg_match_all( $pattern, $value, $matches );
    //print_r($matches);
    //echo "<br><br>";

    $values = array_combine( $keys, array_slice( $matches, 1, count( $keys ), false ) );
    print_r( $values );
    echo "<br><br>";
}

$time = microtime();
$time = explode(" ", $time);
$time = $time[1] + $time[0];
$finish = $time;
$total_time = round(($finish - $start), 4);
echo("<br/><br/><b>Page generated in ".$total_time." seconds.</b><br/><br/>");
?>

我也删除了@,写作:

fopen($filename, "rb") or die("<b>Could not access file: $filename</b><br/><br/>");

代替之前的:

@fopen($txt_path, "rb") or die("<b>Could not access file: $txt_path</b><br/><br/>");

按照建议 here


编辑 1

另一种方法如下:

$txt_path = "./test_2.txt";
$handle = new SplFileObject($txt_path);

// Loop until we reach the end of the file.
$lines_array = array();
while ( !$handle->eof() ) {
    $line = $handle->fgets();
    //echo($line."<br/><br/>"); // Echo one line from the file.
    array_push($lines_array, trim($line));
}

// Unset the file to call __destruct(), closing the file handle.
$handle = null;

$lines_array = array_filter($lines_array);
//print_r($lines_array);
//echo("<br/><br/>");

$lines_joined = implode("\n", $lines_array);
//echo($lines_joined."<br/><br/>");