PHP preg_split 或 preg_match 句但在数组中保留标点符号
PHP preg_split or preg_match sentences but keep punctuation in Array
我希望将一个段落分解成句子,然后分解成 'exploded' 个字符串,但需要将标点符号保留为数组的元素。
示例文本:
$meta = 'I am looking to break this paragraph into chunks.
I have researched, tried and tested various combinations; however, I cannot
seem to make it work. Would anyone help me figure this out?
I thank you in advance...'
所需的输出将是:
Array ( [0] =>
Array ( [0] => I [1] => am [2] => looking [3] => to [4] => break [5] => [6] => this [7] => paragraph [8] => into [9] => chunks [10] => . )
[1] =>
Array ( [0] => I [2] => have [3] => researched [4] => , [5] => tried [......
......] [5] => figure [6] => this [7] => out [8] => ? )
[3] =>
Array ( [0] => I [1] => thank [2] => you [3] => in [4] => advance [5] => ... )
)
我试过使用:
$s = preg_split('/\s*[!?.]\s*/u', $meta, -1, PREG_SPLIT_NO_EMPTY);
将句子分开,但在这样做的同时,标点符号消失了。
如果能帮助我构建带有标点符号的两级数组,我将不胜感激
你可以使用 preg_match:
做你想做的事
$meta = 'I am looking to break this paragraph into chunks.
I have researched, tried and tested various combinations; however, I cannot
seem to make it work. Would anyone help me figure this out?
I thank you in advance...';
preg_match_all('/(\w+|[.;?,]+)/', $meta, $m);
print_r($m);
解释:
/ : regex delimiter
( : begin group 1
\w+ : 1 or more aphanumeric character <=> [a-zA-Z0-9_]
| : OR
[.;?,]+ : 1 or more punctuation
) : end of group 1
/ : regex delimiter
这将匹配并存储在组 1 中的每个单词和每组标点符号。
如果你想与 unicode 兼容,你可以对任何字母使用 \p{L}
,对标点符号使用 \p{P}
:
/(\p{L}+|\p{P}+)/
输出:
Array
(
[0] => Array
(
[0] => I
[1] => am
[2] => looking
[3] => to
[4] => break
[5] => this
[6] => paragraph
[7] => into
[8] => chunks
[9] => .
[10] => I
[11] => have
[12] => researched
[13] => ,
[14] => tried
[15] => and
[16] => tested
[17] => various
[18] => combinations
[19] => ;
[20] => however
[21] => ,
[22] => I
[23] => cannot
[24] => seem
[25] => to
[26] => make
[27] => it
[28] => work
[29] => .
[30] => Would
[31] => anyone
[32] => help
[33] => me
[34] => figure
[35] => this
[36] => out
[37] => ?
[38] => I
[39] => thank
[40] => you
[41] => in
[42] => advance
[43] => ...
)
[1] => Array
(
[0] => I
[1] => am
[2] => looking
[3] => to
[4] => break
[5] => this
[6] => paragraph
[7] => into
[8] => chunks
[9] => .
[10] => I
[11] => have
[12] => researched
[13] => ,
[14] => tried
[15] => and
[16] => tested
[17] => various
[18] => combinations
[19] => ;
[20] => however
[21] => ,
[22] => I
[23] => cannot
[24] => seem
[25] => to
[26] => make
[27] => it
[28] => work
[29] => .
[30] => Would
[31] => anyone
[32] => help
[33] => me
[34] => figure
[35] => this
[36] => out
[37] => ?
[38] => I
[39] => thank
[40] => you
[41] => in
[42] => advance
[43] => ...
)
)
我希望将一个段落分解成句子,然后分解成 'exploded' 个字符串,但需要将标点符号保留为数组的元素。
示例文本:
$meta = 'I am looking to break this paragraph into chunks.
I have researched, tried and tested various combinations; however, I cannot
seem to make it work. Would anyone help me figure this out?
I thank you in advance...'
所需的输出将是:
Array ( [0] =>
Array ( [0] => I [1] => am [2] => looking [3] => to [4] => break [5] => [6] => this [7] => paragraph [8] => into [9] => chunks [10] => . )
[1] =>
Array ( [0] => I [2] => have [3] => researched [4] => , [5] => tried [......
......] [5] => figure [6] => this [7] => out [8] => ? )
[3] =>
Array ( [0] => I [1] => thank [2] => you [3] => in [4] => advance [5] => ... )
)
我试过使用:
$s = preg_split('/\s*[!?.]\s*/u', $meta, -1, PREG_SPLIT_NO_EMPTY);
将句子分开,但在这样做的同时,标点符号消失了。
如果能帮助我构建带有标点符号的两级数组,我将不胜感激
你可以使用 preg_match:
做你想做的事$meta = 'I am looking to break this paragraph into chunks.
I have researched, tried and tested various combinations; however, I cannot
seem to make it work. Would anyone help me figure this out?
I thank you in advance...';
preg_match_all('/(\w+|[.;?,]+)/', $meta, $m);
print_r($m);
解释:
/ : regex delimiter
( : begin group 1
\w+ : 1 or more aphanumeric character <=> [a-zA-Z0-9_]
| : OR
[.;?,]+ : 1 or more punctuation
) : end of group 1
/ : regex delimiter
这将匹配并存储在组 1 中的每个单词和每组标点符号。
如果你想与 unicode 兼容,你可以对任何字母使用 \p{L}
,对标点符号使用 \p{P}
:
/(\p{L}+|\p{P}+)/
输出:
Array
(
[0] => Array
(
[0] => I
[1] => am
[2] => looking
[3] => to
[4] => break
[5] => this
[6] => paragraph
[7] => into
[8] => chunks
[9] => .
[10] => I
[11] => have
[12] => researched
[13] => ,
[14] => tried
[15] => and
[16] => tested
[17] => various
[18] => combinations
[19] => ;
[20] => however
[21] => ,
[22] => I
[23] => cannot
[24] => seem
[25] => to
[26] => make
[27] => it
[28] => work
[29] => .
[30] => Would
[31] => anyone
[32] => help
[33] => me
[34] => figure
[35] => this
[36] => out
[37] => ?
[38] => I
[39] => thank
[40] => you
[41] => in
[42] => advance
[43] => ...
)
[1] => Array
(
[0] => I
[1] => am
[2] => looking
[3] => to
[4] => break
[5] => this
[6] => paragraph
[7] => into
[8] => chunks
[9] => .
[10] => I
[11] => have
[12] => researched
[13] => ,
[14] => tried
[15] => and
[16] => tested
[17] => various
[18] => combinations
[19] => ;
[20] => however
[21] => ,
[22] => I
[23] => cannot
[24] => seem
[25] => to
[26] => make
[27] => it
[28] => work
[29] => .
[30] => Would
[31] => anyone
[32] => help
[33] => me
[34] => figure
[35] => this
[36] => out
[37] => ?
[38] => I
[39] => thank
[40] => you
[41] => in
[42] => advance
[43] => ...
)
)