使用 PHP 从文本文件中提取信息
Extract information from text file using PHP
问题:
使用PHP根据如下结构从文本文件中提取信息:
- 日期(格式为 YYYY-MM-DD)
- 标题
- 文本:值
- 文本:值
- 文本:值
输入:
2015-03-18
Store A
Text 1: 5,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
Store B
Text 1: 10,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
Store C
Text 1: 15,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
2015-03-19
Store D
Text 1: 20,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
PHP 代码(到目前为止):
<?php
// Creates array to store data from textfile
$data = array();
// Opens text file
$text_file = fopen('data.txt', 'r');
// Loops through each line
while ($line = fgets($text_file))
{
// Checks whether line is a date
if (preg_match("/^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1])$/", trim($line)))
{
$data[$line] = array();
}
else
{
$data[] = trim($line);
}
}
// Removes first array key
$data = array_slice($data, 1);
// Prints out full array
echo "<xmp>" . print_r($data, true) . "</xmp>";
?>
HTML代码:
<table border="1">
<tr>
<th>Date</th>
<th>Store</th>
<th>Text 1</th>
<th>Text 2</th>
<th>Text 3</th>
</tr>
<tr>
<td>2015-03-18</td>
<td>Store A</td>
<td>5,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
<tr>
<td></td>
<td>Store B</td>
<td>10,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
<tr>
<td></td>
<td>Store C</td>
<td>15,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
<tr>
<td>2015-03-19</td>
<td>Store D</td>
<td>20,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
</table>
期望输出:
问题:
- 提取和存储不同的文件的合适方法是什么?
价值观?
- 打印信息的正确方式是什么
作为输出示例?
我对源文件中的 'groups' 条记录感兴趣。
日期组 - 由仅包含日期的一行表示
- 商店组 - 包括..
- 店名
- 价格
- 一组日期
新增要求:仅打印当前日期及以后的商店组?我将在代码中将其称为 'cutoff_date'。
我使用 'read-ahead' 技术,所以总有一条记录要处理
我提供函数来帮助 'identify things'。使用它们是为了更容易看到控制逻辑。
代码:
<?php //
/**
* We need to only show store entries on or after a certain date
* i call this the 'cutoff_date'.
*
* It will default to todays date
*/
$now = new DateTime();
$CUTOFF_DATE = $now->format('Y-m-d');
// output stored in here
$outHtml = '<table border="1">
<tr>
<th>Date</th>
<th>Store</th>
<th>Text 1</th>
<th>Text 2</th>
<th>Text 3</th>
</tr>';
// source - we use 'read-ahead' as it makes life easier
$sourceFile = fopen(__DIR__ . '/Q29121286.txt', 'rb');
$currentLine = readNextLine($sourceFile); // read-ahead
while (!empty($currentLine)) { // process until eof...
// start of a date group...
$currentGroupDate = $currentLine; // ignore this group if less than CUTOFF_DATE
$currentLine = readNextLine($sourceFile); // read ahead
while (!empty($currentGroupDate) && $currentGroupDate < $CUTOFF_DATE) { // find next date_group record
while (!empty($currentLine) && datePosition($currentLine) !== 0) { // read to end of current group
$currentLine = readNextLine($sourceFile);
}
$currentGroupDate = $currentLine;
$currentLine = readNextLine($sourceFile); // read ahead
}
$htmlCurrentDate = $currentGroupDate; // only print the date once
$html = '';
// display all the rows for this 'date group' -- look for next 'date'
while (!empty($currentLine) && datePosition($currentLine) !== 0) {
$html = '<tr>';
$html .= '<td>'. $htmlCurrentDate .'</td>';
$htmlCurrentDate = ''; // only display the date once
$html .= '<td>'. $currentLine .'</td>'; // store
$currentLine = readNextLine($sourceFile);
// process the price
$lineParts = explode(':', $currentLine); // need the price...
$html .= '<td>'. $lineParts[1] .'</td>';
$currentLine = readNextLine($sourceFile);
// now process the group of dates - look for a line
// that starts with 'text' and must contain a date
while ( !empty($currentLine)
&& isTextLine($currentLine)
&& datePosition($currentLine) >= 1) {
$lineParts = explode(':', $currentLine); // need the date...
$html .= '<td>'. $lineParts[1] .'</td>';
$currentLine = readNextLine($sourceFile); // read next
}
// end of this group...
$html .= '</tr>';
$outHtml .= $html;
} // end of 'dateGroup'
} // end of data file...
$outHtml .= '</table>';
fclose($sourceFile);
// display output
echo $outHtml;
exit;
/**
* These routines hide the low-level processing;
*/
/**
* Return position of date string - will be -1 if not found
* @param type $line
* @return integer
*/
function datePosition($line)
{
$result = preg_match("/\d{4}-\d{2}-\d{2}/", $line, $matches, PREG_OFFSET_CAPTURE);
$pos = -1;
if (!empty($matches)) {
$match = current($matches);
$pos = $match[1];
}
return $pos;
}
/**
* return whether line is a text line
*
* @param type $text
* @return type
*/
function isTextLine($text)
{
return strpos(strtolower($text), 'text') === 0;
}
/**
* return trimmed string or an empty string at eof
* Added 'fudge' to not read passed the eof - ;-/
* @param type $handle
* @return string
*/
function readNextLine($handle)
{
static $isEOF = false;
if ($isEOF) {
return '';
}
$line = fgets($handle);
if ($line !== false) {
$line = trim($line);
}
else {
$isEOF = true;
$line = '';
}
return $line;
}
所提供文件的原始输出:
| Date | Store | Text 1 | Text 2 | Text 3 |
|------------|---------|-----------|------------|------------|
| 2015-03-18 | Store A | 5,00 USD | 2015-03-18 | 2015-03-12 |
| | Store B | 10,00 USD | 2015-03-18 | 2015-03-12 |
| | Store C | 15,00 USD | 2015-03-18 | 2015-03-12 |
| 2015-03-19 | Store D | 20,00 USD | 2015-03-18 | 2015-03-12 |
问题:
使用PHP根据如下结构从文本文件中提取信息:
- 日期(格式为 YYYY-MM-DD)
- 标题
- 文本:值
- 文本:值
- 文本:值
输入:
2015-03-18
Store A
Text 1: 5,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
Store B
Text 1: 10,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
Store C
Text 1: 15,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
2015-03-19
Store D
Text 1: 20,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
PHP 代码(到目前为止):
<?php
// Creates array to store data from textfile
$data = array();
// Opens text file
$text_file = fopen('data.txt', 'r');
// Loops through each line
while ($line = fgets($text_file))
{
// Checks whether line is a date
if (preg_match("/^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1])$/", trim($line)))
{
$data[$line] = array();
}
else
{
$data[] = trim($line);
}
}
// Removes first array key
$data = array_slice($data, 1);
// Prints out full array
echo "<xmp>" . print_r($data, true) . "</xmp>";
?>
HTML代码:
<table border="1">
<tr>
<th>Date</th>
<th>Store</th>
<th>Text 1</th>
<th>Text 2</th>
<th>Text 3</th>
</tr>
<tr>
<td>2015-03-18</td>
<td>Store A</td>
<td>5,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
<tr>
<td></td>
<td>Store B</td>
<td>10,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
<tr>
<td></td>
<td>Store C</td>
<td>15,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
<tr>
<td>2015-03-19</td>
<td>Store D</td>
<td>20,00 USD</td>
<td>2015-03-18</td>
<td>2015-03-12</td>
</tr>
</table>
期望输出:
问题:
- 提取和存储不同的文件的合适方法是什么? 价值观?
- 打印信息的正确方式是什么 作为输出示例?
我对源文件中的 'groups' 条记录感兴趣。
日期组 - 由仅包含日期的一行表示
- 商店组 - 包括..
- 店名
- 价格
- 一组日期
新增要求:仅打印当前日期及以后的商店组?我将在代码中将其称为 'cutoff_date'。
我使用 'read-ahead' 技术,所以总有一条记录要处理
我提供函数来帮助 'identify things'。使用它们是为了更容易看到控制逻辑。
代码:
<?php //
/**
* We need to only show store entries on or after a certain date
* i call this the 'cutoff_date'.
*
* It will default to todays date
*/
$now = new DateTime();
$CUTOFF_DATE = $now->format('Y-m-d');
// output stored in here
$outHtml = '<table border="1">
<tr>
<th>Date</th>
<th>Store</th>
<th>Text 1</th>
<th>Text 2</th>
<th>Text 3</th>
</tr>';
// source - we use 'read-ahead' as it makes life easier
$sourceFile = fopen(__DIR__ . '/Q29121286.txt', 'rb');
$currentLine = readNextLine($sourceFile); // read-ahead
while (!empty($currentLine)) { // process until eof...
// start of a date group...
$currentGroupDate = $currentLine; // ignore this group if less than CUTOFF_DATE
$currentLine = readNextLine($sourceFile); // read ahead
while (!empty($currentGroupDate) && $currentGroupDate < $CUTOFF_DATE) { // find next date_group record
while (!empty($currentLine) && datePosition($currentLine) !== 0) { // read to end of current group
$currentLine = readNextLine($sourceFile);
}
$currentGroupDate = $currentLine;
$currentLine = readNextLine($sourceFile); // read ahead
}
$htmlCurrentDate = $currentGroupDate; // only print the date once
$html = '';
// display all the rows for this 'date group' -- look for next 'date'
while (!empty($currentLine) && datePosition($currentLine) !== 0) {
$html = '<tr>';
$html .= '<td>'. $htmlCurrentDate .'</td>';
$htmlCurrentDate = ''; // only display the date once
$html .= '<td>'. $currentLine .'</td>'; // store
$currentLine = readNextLine($sourceFile);
// process the price
$lineParts = explode(':', $currentLine); // need the price...
$html .= '<td>'. $lineParts[1] .'</td>';
$currentLine = readNextLine($sourceFile);
// now process the group of dates - look for a line
// that starts with 'text' and must contain a date
while ( !empty($currentLine)
&& isTextLine($currentLine)
&& datePosition($currentLine) >= 1) {
$lineParts = explode(':', $currentLine); // need the date...
$html .= '<td>'. $lineParts[1] .'</td>';
$currentLine = readNextLine($sourceFile); // read next
}
// end of this group...
$html .= '</tr>';
$outHtml .= $html;
} // end of 'dateGroup'
} // end of data file...
$outHtml .= '</table>';
fclose($sourceFile);
// display output
echo $outHtml;
exit;
/**
* These routines hide the low-level processing;
*/
/**
* Return position of date string - will be -1 if not found
* @param type $line
* @return integer
*/
function datePosition($line)
{
$result = preg_match("/\d{4}-\d{2}-\d{2}/", $line, $matches, PREG_OFFSET_CAPTURE);
$pos = -1;
if (!empty($matches)) {
$match = current($matches);
$pos = $match[1];
}
return $pos;
}
/**
* return whether line is a text line
*
* @param type $text
* @return type
*/
function isTextLine($text)
{
return strpos(strtolower($text), 'text') === 0;
}
/**
* return trimmed string or an empty string at eof
* Added 'fudge' to not read passed the eof - ;-/
* @param type $handle
* @return string
*/
function readNextLine($handle)
{
static $isEOF = false;
if ($isEOF) {
return '';
}
$line = fgets($handle);
if ($line !== false) {
$line = trim($line);
}
else {
$isEOF = true;
$line = '';
}
return $line;
}
所提供文件的原始输出:
| Date | Store | Text 1 | Text 2 | Text 3 |
|------------|---------|-----------|------------|------------|
| 2015-03-18 | Store A | 5,00 USD | 2015-03-18 | 2015-03-12 |
| | Store B | 10,00 USD | 2015-03-18 | 2015-03-12 |
| | Store C | 15,00 USD | 2015-03-18 | 2015-03-12 |
| 2015-03-19 | Store D | 20,00 USD | 2015-03-18 | 2015-03-12 |