使用 PHP 读取两个文件,数学计算并写出结果的最有效方法

Most efficient method of using PHP to read two files, math calculate, and write out result

我写了一个 PHP 脚本,它可以打开两个非常大的文件 (>1gb),这两个文件都包含 4 列。该脚本对每个文件中每一行的相应值进行计算,并将结果写出到第三个文件。

我的方法非常慢。我正在使用 SplFileObject 读取原始文件并逐行移动内部指针,如下面的代码所述。

然后逐行写出结果。然而,计算和写出都非常慢(即使我禁用写入,脚本也很慢)。我认为我的 file reading/writing 方法效率很低,我希望得到优化提示。

function generate_adjusted($WGFile, $RFile) {

        // File Read objects
        $WGObj = new SplFileObject($WGFile);
        $RObj = new SplFileObject($RFile);

        // File write object
        $adjHandle = fopen("outputfile.txt", 'w+');

        foreach ($WGObj as $line) {
            // Line 0: ID1 (int), 1: ID2 (int), 2: NSNPs (int), 3: Relationship (real)
            $WGline = explode("\t", $WGObj->current());

            // Seek to the same line of second file
            $RObj->seek($WGObj->key());

            $Rline = explode("\t", $RObj->current());
            $A1 = floatval($WGline[2] * $WGline[3]);
            $A2 = floatval($Rline[2] * $Rline[3]);
            $ANSNP = $WGline[2] - $Rline[2];
            $A3 = round(floatval(($A1 - $A2) / $ANSNP), 3);

            // Construct the adjusted line
            $adjLine = $WGline[0] . "\t" . $WGline[1] . "\t" . $ANSNP . "\t" . $A3 . "\r\n";

            fwrite($adjHandle, $adjLine);
        }
        fclose($adjHandle);     
}

generate_adjusted('inputfile1.txt', 'inputfile2.txt');

第一个也是最好的建议:对其进行基准测试!

不要把你得到的任何建议当作确定的事实(即使是我的)。性能会因您的操作系统、硬件和 PHP 版本而异。

以下应该是一种快速的方法,并直接为您提供了一个微基准测试。请测试并告诉我们。

<?php

$start = microtime(true);

function get_file_handle($file, $mode) {
    $h = fopen(__DIR__ . DIRECTORY_SEPARATOR . $file, "{$mode}b");
    if (!$h) {
        trigger_error("Could not read {$file}.", E_USER_ERROR);
    }

    // Make sure nobody else is reading or writing to our file.
    if (flock($h, LOCK_SH | LOCK_EX) === false) {
        trigger_error("Could not acquire lock for {$file}", E_USER_ERROR);
    }

    return $h;
}

// We only want to read and not write.
$input_handle1 = get_file_handle("input1", "r");
$input_handle2 = get_file_handle("input2", "r");

// We only want to write and not read.
$output_handle = get_file_handle("output", "w");

// Read from both files at the same time the next line.
// NOTE: This only works if lines are always corresponding in both files.
while (($buffer1 = fgets($input_handle1)) !== false && ($buffer2 = fgets($input_handle2)) !== false) {
    $buffer1 = explode("\t", $buffer1);
    $buffer2 = explode("\t", $buffer2);

    // Forget floatval, let PHP do its dynamic casting.
    // NOTE: If precision is important use e.g. bcmath!
    $a1 = $buffer1[2] * $buffer1[3];
    $a2 = $buffer2[2] * $buffer2[3];
    $ansnp = $buffer1[2] - $buffer2[2];
    $a3 = round(($a1 - $a2) / $ansnp, 3);

    if (fwrite($output_handle, "{$buffer1[0]}\t{$buffer1[1]}\t{$ansnp}\t{$a3}\r\n") === false) {
        trigger_error("Could not write result to output file.", E_USER_ERROR);
    }
}

// Release locks on and close all file handles.
foreach (array($input_handle1, $input_handle2, $output_handle) as $delta => $handle) {
    if (flock($handle, LOCK_UN) === false) {
        trigger_error("Could not release lock!", E_USER_ERROR);
    }
    if (fclose($handle) === false) {
        trigger_error("Could not close file handle!", E_USER_ERROR);
    }
}

echo "Finished processing after " , (microtime(true) - $start) , PHP_EOL;

当然这也可以以面向对象的方式完成,也可以有例外等。

行缓冲

// Determines how many lines to buffer between each calculation/write.
$lines_to_buffer = 1000;

while (!feof($input_handle1) && !feof($input_handle2)) {
    $c1 = $c2 = 0;

    // Read lines from first handle, then read files from second handle.
    // NOTE: Reading multiple lines from the same file in a row allows us to make best use of the hard disk, if it isn't
    // an SSD, since we consecutively read from the same location which yields minimum seeks. But also keep in mind that
    // this might not be true if multiple processes are running in parallel, since they might read from different files
    // at the same time.
    foreach (array(1 => $input_handle1, 2 => $input_handle2) as $i => $handle) {
        while (($line = fgets($handle)) !== false) {
            ${"buffer{$i}"}[] = explode("\t", $line);

            // Break if we read enough lines.
            if (++${"c{$i}"} === $lines_to_buffer) {
                break;
            }
        }
    }

    // Validate?
    if ($c1 !== $c2) {
        trigger_error("Lines from input files differ, aborting.", E_USER_ERROR);
    }

    for ($i = 0; $i < $lines_to_buffer; ++$i) {
        $a1 = $buffer1[$i][2] * $buffer1[$i][3];
        $a2 = $buffer2[$i][2] * $buffer2[$i][3];
        $ansnp = $buffer1[$i][2] - $buffer2[$i][2];
        $a3 = round(($a1 - $a2) / $ansnp, 3);
        $result .= "{$buffer1[0]}\t{$buffer1[1]}\t{$ansnp}\t{$a3}\r\n";
    }
    fwrite($output_handle, $result);

    // Reset
    $result = $buffer1 = $buffer2 = null;
}