Perl:如何将保留最后一句话的段落拆分到另一个数组中?
Perl: How to split paragraph keeping last sentences into another array?
我正在尝试将 <Description>
文本拆分为 Bit
number 并放入特定的 Bit
number元素。这是文件,我正在解析。
<Register>
<Name>abc</Name>
<Abstract></Abstract>
<Description>Bit 6 random description
Bit 5 msg octet 2
Bit 4-1
Bit 0 msg octet 4
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
</Description>
<Field>
<Name>qwe</Name>
<Description></Description>
<BitFieldOffset>6</BitFieldOffset>
<Size>1</Size>
<AccessMode>Read/Write</AccessMode>
</Field>
<Field>
<Name>qwe</Name>
<Description></Description>
<BitFieldOffset>5</BitFieldOffset>
<Size>1</Size>
<AccessMode>Read/Write</AccessMode>
</Field>
<Field>
....
</Field>
</Register>
<Register>
<Name>xyz</Name>
<Abstract></Abstract>
<Description>Bit 3 msg octet 1
Bit 2 msg octet 2
Bit 1 msg octet 3
Bit 0 msg octet 4
These registers.
They should be written only after the cleared by hardware.
</Description>
<Field>
....
</Field>
<Field>
....
</Field>
</Register>
预期输出为:
Expected output:
<Register>
<long_description>
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
</long_description>
<bit_field position="6" width=" 1">
<long_description>
<p> random description</p>
</long_description>
<bit_field position="5" width=" 1">
<long_description>
<p>...</p>
</long_description>
<bit_field position="1" width=" 4">
<long_description>
<p>...</p>
</long_description>
</Register>
<Register>
.
.
.
</Register>
我正在使用 XML-Twig 包来解析这个文件,但在拆分过程中卡住了。
foreach my $register ( $twig->get_xpath('//Register') ) # get each <Register>
{
my $reg_description= $register->first_child('Description')->text;
.
.
.
foreach my $xml_field ($register->get_xpath('Field'))
{
.
.
my @matched = split ('Bit\s+[0-9]', $reg_description);
.
.
}
}
我不知道如何相应地创建 <bit_field>
并将 Bit
以外的文本保留到 <Register> <long_description>
中。有人可以帮忙吗?
编辑:
<Description>
中的 Bit
可以有多行。例如,在下面的示例中,Bit 10-9
的描述一直到 Bit 8
开始
<Description>Bit 11 GOOF
Bit 10-9 Clk Selection:
00 : 8 MHz
01 : 4 MHz
10 : 2 MHz
11 : 1 MHz
Bit 8 Clk Enable : 1 = Enable CLK
<Description>
如果我做对了,您可以逐行查看整个文本块。
使用正则表达式来检查一行是否与模式匹配一点。捕获相关部分。逐位缓存在一个数组中,其中包含存储每一位详细信息的哈希值。
不包含位模式的缓冲行。如果另一行包含位模式,则缓冲区必须属于最近的位。将其附加到那里。所有其他行必须是整体描述的一部分。 注意:这不区分最后一位描述的任何附加行。如果有这样的一点,它的附加行将成为整体描述的开始。 (但你说这些东西不在你的数据中。)
概念验证:
#!/usr/bin/perl
use strict;
use warnings;
my $description_in = 'Bit 6 random description
Bla bla additional line bla bla
bla bla
Bit 5 msg octet 2
Empty line below
Bla bla set to gain instant world domination bla bla
Bit 4-1
Bit 0 msg octet 4
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
Empty line above
Bla bla bla...';
my @bits = ();
my $description_overall = '';
my $line_buffer = '';
foreach my $line (split("\n", $description_in)) {
# if line
# begins with optional white spaces
# followed by "Bit"
# followed by at least one white space
# followed by at least one digit (we capture the digits)
# followed by an optional sequence of optional white spaces, "-", optional white spaces and at least one digit (we capture the digits)
# followed by an optional sequence of at least one white space and any characters (we capture the characters)
# followed by the end of the line
if ($line =~ m/^\s*Bit\s+(\d+)(?:\s*-\s*(\d+))?(?:\s+(.*?))?$/) {
my ($position_begin, $position_end, $description) = (, , );
my $width;
# if there already are bits we've processed
if (scalar(@bits)) {
# the lines possibly buffered belong to the bit before the current one, so append them to its description
$bits[$#bits]->{description} .= (length($bits[$#bits]->{description}) ? "\n" : '') . $line_buffer;
# and reset the line buffer to collect the additional lines of the current bit;
$line_buffer = '';
}
# $position_end is defined only if it was a "Bit n-m"
# otherwise set it to $position_begin
$position_end = defined($position_end) ? $position_end : $position_begin;
$width = abs($position_end - $position_begin) + 1;
# set description to the empty string if not defined (i.e. no description was found)
$description = defined($description) ? $description : '';
# push a ref to a new hash with the keys position, description and width into the list of bits
push(@bits, { position => (sort({$a <=> $b} ($position_begin, $position_end)))[0], # always take the lower position
description => $description,
width => $width });
}
else {
# it's not a bit pattern, so just buffer the line
$line_buffer .= (length($line_buffer) ? "\n" : '') . $line;
}
}
# anything still in the buffer must belong to the overall description
$description_overall .= $line_buffer;
print("<Register>\n <long_description>\n$description_overall\n </long_description>\n");
foreach my $bit (@bits) {
print(" <bit_field position=\"$bit->{position}\" width=\"$bit->{width}\">\n <long_description>\n$bit->{description}\n </long_description>\n </bit_field>\n")
}
print("</Register>\n");
打印:
<Register>
<long_description>
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
Empty line above
Bla bla bla...
</long_description>
<bit_field position="6" width="1">
<long_description>
random description
Bla bla additional line bla bla
bla bla
</long_description>
</bit_field>
<bit_field position="5" width="1">
<long_description>
msg octet 2
Empty line below
Bla bla set to gain instant world domination bla bla
</long_description>
</bit_field>
<bit_field position="1" width="4">
<long_description>
</long_description>
</bit_field>
<bit_field position="0" width="1">
<long_description>
msg octet 4
</long_description>
</bit_field>
</Register>
我把它写成独立的脚本,这样我就可以测试它了。您必须将其改编成您的脚本。
也许可以对整体描述进行一些处理,消除那些长长的空格序列。
首先,我尝试使用连续模式 (while ($x =~ m/^...$/gc)
),但不知何故,它吞噬了行尾,导致仅匹配每隔一行。 Lookarounds,为了让它们远离实际匹配,没有工作(说它没有实现;我想,我必须在这台计算机上检查我的 Perl?),所以显式拆分成行是一种解决方法.
也可以使用 grep()
s、map()
s 等来缩短它。但我认为详细版本更好地展示了这些想法。所以我什至都没有看。
我正在尝试将 <Description>
文本拆分为 Bit
number 并放入特定的 Bit
number元素。这是文件,我正在解析。
<Register>
<Name>abc</Name>
<Abstract></Abstract>
<Description>Bit 6 random description
Bit 5 msg octet 2
Bit 4-1
Bit 0 msg octet 4
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
</Description>
<Field>
<Name>qwe</Name>
<Description></Description>
<BitFieldOffset>6</BitFieldOffset>
<Size>1</Size>
<AccessMode>Read/Write</AccessMode>
</Field>
<Field>
<Name>qwe</Name>
<Description></Description>
<BitFieldOffset>5</BitFieldOffset>
<Size>1</Size>
<AccessMode>Read/Write</AccessMode>
</Field>
<Field>
....
</Field>
</Register>
<Register>
<Name>xyz</Name>
<Abstract></Abstract>
<Description>Bit 3 msg octet 1
Bit 2 msg octet 2
Bit 1 msg octet 3
Bit 0 msg octet 4
These registers.
They should be written only after the cleared by hardware.
</Description>
<Field>
....
</Field>
<Field>
....
</Field>
</Register>
预期输出为:
Expected output:
<Register>
<long_description>
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
</long_description>
<bit_field position="6" width=" 1">
<long_description>
<p> random description</p>
</long_description>
<bit_field position="5" width=" 1">
<long_description>
<p>...</p>
</long_description>
<bit_field position="1" width=" 4">
<long_description>
<p>...</p>
</long_description>
</Register>
<Register>
.
.
.
</Register>
我正在使用 XML-Twig 包来解析这个文件,但在拆分过程中卡住了。
foreach my $register ( $twig->get_xpath('//Register') ) # get each <Register>
{
my $reg_description= $register->first_child('Description')->text;
.
.
.
foreach my $xml_field ($register->get_xpath('Field'))
{
.
.
my @matched = split ('Bit\s+[0-9]', $reg_description);
.
.
}
}
我不知道如何相应地创建 <bit_field>
并将 Bit
以外的文本保留到 <Register> <long_description>
中。有人可以帮忙吗?
编辑:
<Description>
中的 Bit
可以有多行。例如,在下面的示例中,Bit 10-9
的描述一直到 Bit 8
<Description>Bit 11 GOOF
Bit 10-9 Clk Selection:
00 : 8 MHz
01 : 4 MHz
10 : 2 MHz
11 : 1 MHz
Bit 8 Clk Enable : 1 = Enable CLK
<Description>
如果我做对了,您可以逐行查看整个文本块。
使用正则表达式来检查一行是否与模式匹配一点。捕获相关部分。逐位缓存在一个数组中,其中包含存储每一位详细信息的哈希值。
不包含位模式的缓冲行。如果另一行包含位模式,则缓冲区必须属于最近的位。将其附加到那里。所有其他行必须是整体描述的一部分。 注意:这不区分最后一位描述的任何附加行。如果有这样的一点,它的附加行将成为整体描述的开始。 (但你说这些东西不在你的数据中。)
概念验证:
#!/usr/bin/perl
use strict;
use warnings;
my $description_in = 'Bit 6 random description
Bla bla additional line bla bla
bla bla
Bit 5 msg octet 2
Empty line below
Bla bla set to gain instant world domination bla bla
Bit 4-1
Bit 0 msg octet 4
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
Empty line above
Bla bla bla...';
my @bits = ();
my $description_overall = '';
my $line_buffer = '';
foreach my $line (split("\n", $description_in)) {
# if line
# begins with optional white spaces
# followed by "Bit"
# followed by at least one white space
# followed by at least one digit (we capture the digits)
# followed by an optional sequence of optional white spaces, "-", optional white spaces and at least one digit (we capture the digits)
# followed by an optional sequence of at least one white space and any characters (we capture the characters)
# followed by the end of the line
if ($line =~ m/^\s*Bit\s+(\d+)(?:\s*-\s*(\d+))?(?:\s+(.*?))?$/) {
my ($position_begin, $position_end, $description) = (, , );
my $width;
# if there already are bits we've processed
if (scalar(@bits)) {
# the lines possibly buffered belong to the bit before the current one, so append them to its description
$bits[$#bits]->{description} .= (length($bits[$#bits]->{description}) ? "\n" : '') . $line_buffer;
# and reset the line buffer to collect the additional lines of the current bit;
$line_buffer = '';
}
# $position_end is defined only if it was a "Bit n-m"
# otherwise set it to $position_begin
$position_end = defined($position_end) ? $position_end : $position_begin;
$width = abs($position_end - $position_begin) + 1;
# set description to the empty string if not defined (i.e. no description was found)
$description = defined($description) ? $description : '';
# push a ref to a new hash with the keys position, description and width into the list of bits
push(@bits, { position => (sort({$a <=> $b} ($position_begin, $position_end)))[0], # always take the lower position
description => $description,
width => $width });
}
else {
# it's not a bit pattern, so just buffer the line
$line_buffer .= (length($line_buffer) ? "\n" : '') . $line;
}
}
# anything still in the buffer must belong to the overall description
$description_overall .= $line_buffer;
print("<Register>\n <long_description>\n$description_overall\n </long_description>\n");
foreach my $bit (@bits) {
print(" <bit_field position=\"$bit->{position}\" width=\"$bit->{width}\">\n <long_description>\n$bit->{description}\n </long_description>\n </bit_field>\n")
}
print("</Register>\n");
打印:
<Register>
<long_description>
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
Empty line above
Bla bla bla...
</long_description>
<bit_field position="6" width="1">
<long_description>
random description
Bla bla additional line bla bla
bla bla
</long_description>
</bit_field>
<bit_field position="5" width="1">
<long_description>
msg octet 2
Empty line below
Bla bla set to gain instant world domination bla bla
</long_description>
</bit_field>
<bit_field position="1" width="4">
<long_description>
</long_description>
</bit_field>
<bit_field position="0" width="1">
<long_description>
msg octet 4
</long_description>
</bit_field>
</Register>
我把它写成独立的脚本,这样我就可以测试它了。您必须将其改编成您的脚本。
也许可以对整体描述进行一些处理,消除那些长长的空格序列。
首先,我尝试使用连续模式 (while ($x =~ m/^...$/gc)
),但不知何故,它吞噬了行尾,导致仅匹配每隔一行。 Lookarounds,为了让它们远离实际匹配,没有工作(说它没有实现;我想,我必须在这台计算机上检查我的 Perl?),所以显式拆分成行是一种解决方法.
也可以使用 grep()
s、map()
s 等来缩短它。但我认为详细版本更好地展示了这些想法。所以我什至都没有看。