将具有特定模式的行分组为一行作为 csv 文本文件
Grouping lines with a specific pattern into one line as a csv text file
我正在为文本数据编写解析器。我几乎完成了...但是 php 脚本必须在具有 PHP 版本 5.3.13 的服务器上运行。而且没有办法升级。所以我试着重写剧本,但是……我想我把它弄坏了。根本没用。
首先是我需要解析的源文本数据:
27 may 15:28 Id: 42 #1 Random Text
Info: 3 Location: Street Guests: 2
(Text header 1) Apple 15
(Text header 2) Milk 2
(Text header 1) Ice cream 4
(Text header 3) Pencil 1
(Text header 1) Box 1
(Text header 2) Cardboard x1
(Text header 3) White x1
(Text header 1) Cube x1
(Text header 1) Phone 1
(Text header 1) Specific text x1
(Text header 1) Symbian x1
第二个是所需的输出,我需要的结果文本文件:
42 ; 15:28
Apple ; 15 ; NOHANDLE ; NOHANDLE
Milk ; 2 ; NOHANDLE ; NOHANDLE
Ice cream ; 4 ; NOHANDLE ; NOHANDLE
Pencil ; 1 ; NOHANDLE ; NOHANDLE
Box ; 1 ; Cardboard, White, Cube ; NOHANDLE
Phone ; 1 ; Symbian ; Specific text
NOHANDLE 是必需的,因为如您所见,它是一个 CSV 文件。为了使 CSV 正常工作,每行需要具有相同数量的列。所以每次没有 "child" 字符串时我都必须添加 NOHADLE。
最后,这是我尝试以正确方式工作的代码:
<?php
$data = trim(file_get_contents('inbox_file_utf8_clean.txt'));
$all_lines = preg_split("/\r?\n/", $data);
$date_id_line = array_shift($all_lines);
if(!preg_match('/^\d+\s\w+\s(?<time>\d+:\d+)\sId:\s(?<id>\d+).*/', $date_id_line, $matches)) {
trigger_error('Failed to match ID and timestamp', E_USER_ERROR);
}
$output_data = array(
'info' => array(
'id' => $matches['id'],
'time' => $matches['time']
),
'data' => array()
);
$all_text_headers = array_values(preg_grep('/^\s*\(/', $all_lines));
// The first "Text header" is a parent.
// Count the number of leading whitespaces to determine other parents
preg_match('/^\x20*/', $all_text_headers[0], $leading_space_matches);
$leading_spaces = $leading_space_matches[0];
$num_leading_spaces = strlen($leading_spaces);
$parent_lead = str_repeat(' ', $num_leading_spaces) . '(';
$parent = NULL;
foreach($all_text_headers as $index => $header_line) {
array($lead, $item_value) = explode( ") ", $header_line);
array($topic, $topic_count) = array_map('trim',
preg_split('/\s{2,}/', $item_value, -1, PREG_SPLIT_NO_EMPTY)
);
$topic_count = (int) $topic_count;
if($is_parent = ($parent === NULL || strpos($lead, $parent_lead) === 0)) {
$parent = $topic;
}
// This only goes one level deep
if($is_parent) {
$output_data['data'][$parent] = array(
'values' => array(),
'count' => $topic_count
);
} else {
$output_data['data'][$parent]['values'][] = $topic;
}
};
$csv_delimiter = ';';
$handle = fopen('output_file.csv', 'wb');
fputcsv($handle, array_values($output_data['info']), $csv_delimiter);
foreach($output_data['data'] as $key => $values) {
$row = [
$key,
$values['count'],
implode(', ', $values['values']) ?: 'NOHANDLE',
'NOHANDLE'
];
fputcsv($handle, $row, $csv_delimiter);
}
fclose($handle);
?>
现在我卡住了...我收到此错误:
Parse error: syntax error, unexpected '=' in index.php on line 29
你说得对,你必须使用 array() 而不是 [ ]
和错误行
array($lead, $item_value) = explode( ") ", $header_line);
一定是这样的:
list($lead, $item_value) = explode(') ', $header_line);
并且在下一行你必须使用 list ()
我会尝试进行所有更正:
<?php
$data = trim(file_get_contents('inbox_file_utf8_clean.txt'));
$all_lines = preg_split("/\r?\n/", $data);
$date_id_line = array_shift($all_lines);
if(!preg_match('/^\d+\s\w+\s(?<time>\d+:\d+)\sId:\s(?<id>\d+).*/', $date_id_line, $matches)) {
trigger_error('Failed to match ID and timestamp', E_USER_ERROR);
}
$output_data = array(
'info' => array(
'id' => $matches['id'],
'time' => $matches['time']
),
'data' => array()
);
$all_text_headers = array_values(preg_grep('/^\s*\(/', $all_lines));
// The first "Text header" is a parent.
// Count the number of leading whitespaces to determine other parents
preg_match('/^\x20*/', $all_text_headers[0], $leading_space_matches);
$leading_spaces = $leading_space_matches[0];
$num_leading_spaces = strlen($leading_spaces);
$parent_lead = str_repeat(' ', $num_leading_spaces) . '(';
$parent = NULL;
foreach($all_text_headers as $index => $header_line) {
list($lead, $item_value) = explode(') ', $header_line);
list($topic, $topic_count) = array_map('trim',
preg_split('/\s{2,}/', $item_value, -1, PREG_SPLIT_NO_EMPTY)
);
$topic_count = (int) $topic_count;
if($is_parent = ($parent === NULL || strpos($lead, $parent_lead) === 0)) {
$parent = $topic;
}
// This only goes one level deep
if($is_parent) {
$output_data['data'][$parent] = array(
'values' => array(),
'count' => $topic_count
);
} else {
$output_data['data'][$parent]['values'][] = $topic;
}
};
$csv_delimiter = ';';
$handle = fopen('output_file.csv', 'wb');
fputcsv($handle, array_values($output_data['info']), $csv_delimiter);
foreach($output_data['data'] as $key => $values) {
$row = array(
$key,
$values['count'],
implode(', ', $values['values']) ?: 'NOHANDLE',
'NOHANDLE'
);
fputcsv($handle, $row, $csv_delimiter);
}
fclose($handle);
?>
我正在为文本数据编写解析器。我几乎完成了...但是 php 脚本必须在具有 PHP 版本 5.3.13 的服务器上运行。而且没有办法升级。所以我试着重写剧本,但是……我想我把它弄坏了。根本没用。
首先是我需要解析的源文本数据:
27 may 15:28 Id: 42 #1 Random Text
Info: 3 Location: Street Guests: 2
(Text header 1) Apple 15
(Text header 2) Milk 2
(Text header 1) Ice cream 4
(Text header 3) Pencil 1
(Text header 1) Box 1
(Text header 2) Cardboard x1
(Text header 3) White x1
(Text header 1) Cube x1
(Text header 1) Phone 1
(Text header 1) Specific text x1
(Text header 1) Symbian x1
第二个是所需的输出,我需要的结果文本文件:
42 ; 15:28
Apple ; 15 ; NOHANDLE ; NOHANDLE
Milk ; 2 ; NOHANDLE ; NOHANDLE
Ice cream ; 4 ; NOHANDLE ; NOHANDLE
Pencil ; 1 ; NOHANDLE ; NOHANDLE
Box ; 1 ; Cardboard, White, Cube ; NOHANDLE
Phone ; 1 ; Symbian ; Specific text
NOHANDLE 是必需的,因为如您所见,它是一个 CSV 文件。为了使 CSV 正常工作,每行需要具有相同数量的列。所以每次没有 "child" 字符串时我都必须添加 NOHADLE。
最后,这是我尝试以正确方式工作的代码:
<?php
$data = trim(file_get_contents('inbox_file_utf8_clean.txt'));
$all_lines = preg_split("/\r?\n/", $data);
$date_id_line = array_shift($all_lines);
if(!preg_match('/^\d+\s\w+\s(?<time>\d+:\d+)\sId:\s(?<id>\d+).*/', $date_id_line, $matches)) {
trigger_error('Failed to match ID and timestamp', E_USER_ERROR);
}
$output_data = array(
'info' => array(
'id' => $matches['id'],
'time' => $matches['time']
),
'data' => array()
);
$all_text_headers = array_values(preg_grep('/^\s*\(/', $all_lines));
// The first "Text header" is a parent.
// Count the number of leading whitespaces to determine other parents
preg_match('/^\x20*/', $all_text_headers[0], $leading_space_matches);
$leading_spaces = $leading_space_matches[0];
$num_leading_spaces = strlen($leading_spaces);
$parent_lead = str_repeat(' ', $num_leading_spaces) . '(';
$parent = NULL;
foreach($all_text_headers as $index => $header_line) {
array($lead, $item_value) = explode( ") ", $header_line);
array($topic, $topic_count) = array_map('trim',
preg_split('/\s{2,}/', $item_value, -1, PREG_SPLIT_NO_EMPTY)
);
$topic_count = (int) $topic_count;
if($is_parent = ($parent === NULL || strpos($lead, $parent_lead) === 0)) {
$parent = $topic;
}
// This only goes one level deep
if($is_parent) {
$output_data['data'][$parent] = array(
'values' => array(),
'count' => $topic_count
);
} else {
$output_data['data'][$parent]['values'][] = $topic;
}
};
$csv_delimiter = ';';
$handle = fopen('output_file.csv', 'wb');
fputcsv($handle, array_values($output_data['info']), $csv_delimiter);
foreach($output_data['data'] as $key => $values) {
$row = [
$key,
$values['count'],
implode(', ', $values['values']) ?: 'NOHANDLE',
'NOHANDLE'
];
fputcsv($handle, $row, $csv_delimiter);
}
fclose($handle);
?>
现在我卡住了...我收到此错误:
Parse error: syntax error, unexpected '=' in index.php on line 29
你说得对,你必须使用 array() 而不是 [ ]
和错误行
array($lead, $item_value) = explode( ") ", $header_line);
一定是这样的:
list($lead, $item_value) = explode(') ', $header_line);
并且在下一行你必须使用 list ()
我会尝试进行所有更正:
<?php
$data = trim(file_get_contents('inbox_file_utf8_clean.txt'));
$all_lines = preg_split("/\r?\n/", $data);
$date_id_line = array_shift($all_lines);
if(!preg_match('/^\d+\s\w+\s(?<time>\d+:\d+)\sId:\s(?<id>\d+).*/', $date_id_line, $matches)) {
trigger_error('Failed to match ID and timestamp', E_USER_ERROR);
}
$output_data = array(
'info' => array(
'id' => $matches['id'],
'time' => $matches['time']
),
'data' => array()
);
$all_text_headers = array_values(preg_grep('/^\s*\(/', $all_lines));
// The first "Text header" is a parent.
// Count the number of leading whitespaces to determine other parents
preg_match('/^\x20*/', $all_text_headers[0], $leading_space_matches);
$leading_spaces = $leading_space_matches[0];
$num_leading_spaces = strlen($leading_spaces);
$parent_lead = str_repeat(' ', $num_leading_spaces) . '(';
$parent = NULL;
foreach($all_text_headers as $index => $header_line) {
list($lead, $item_value) = explode(') ', $header_line);
list($topic, $topic_count) = array_map('trim',
preg_split('/\s{2,}/', $item_value, -1, PREG_SPLIT_NO_EMPTY)
);
$topic_count = (int) $topic_count;
if($is_parent = ($parent === NULL || strpos($lead, $parent_lead) === 0)) {
$parent = $topic;
}
// This only goes one level deep
if($is_parent) {
$output_data['data'][$parent] = array(
'values' => array(),
'count' => $topic_count
);
} else {
$output_data['data'][$parent]['values'][] = $topic;
}
};
$csv_delimiter = ';';
$handle = fopen('output_file.csv', 'wb');
fputcsv($handle, array_values($output_data['info']), $csv_delimiter);
foreach($output_data['data'] as $key => $values) {
$row = array(
$key,
$values['count'],
implode(', ', $values['values']) ?: 'NOHANDLE',
'NOHANDLE'
);
fputcsv($handle, $row, $csv_delimiter);
}
fclose($handle);
?>