正则表达式到 "normalize" 之后 SPACE 的用法。 , : 字符(和一些例外)
Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)
我需要规范一些关于 .
、,
、:
符号 (否 space 之前和一个 space 之后)
我想出的正则表达式是这样的:
$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', ' ', $variation['DESCRIPTION']);
问题是这匹配了四种不应触及的情况:
- 任何十进制数,例如 5.5
- 任何千位分隔符,例如 4,500
- 希腊语中的“固定”短语,
ό,τι
- 省略号
...
- 基本上省略号是一个完全特殊的情况,我认为应该在单独的 preg_replace
可能是?我的意思是,三个点应该被视为一件事,这意味着 some text ...
确实应该匹配并转换为 some text...
而不是 some text. . .
特别是对于数字异常,我知道它可以通过一些负数 lookahead/lookbehind 来实现,但不幸的是我无法将它们组合到我当前的模式中。
This是一个fiddle让你检查(不应该匹配的情况在第2、3、4行)。
编辑:下面发布的两种解决方案都可以正常工作,但最终会在描述的最后一个句号之后添加 space。这不是什么大问题,因为在我的代码前面,我处理了 <br />
s 和 spaces 的开头和结尾描述,所以我把这个 preg_replace 移到了那个之前...
所以,我最终使用的最终代码是这样的:
$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])(?!(?<=\d.)\d)(?!(?<=ό,)τι)\s*#ui', ' ', $variation['DESCRIPTION']);
$variation['DESCRIPTION'] = preg_replace('#^\s*(<br />)*\s*|\s*(<br />)*\s*$#', '', $variation['DESCRIPTION']);
所以唯一剩下要做的就是更改此代码,使其按照我上面描述的方式处理省略号。
对于最后一项要求的任何帮助,我们将不胜感激! TIA
您可以添加两个包含回顾的先行:
\s*(\.{2,}|[:,.](?!(?<=ό,)τι)(?!(?<=\d.)\d))(?!\s*<br\s*/>)\s*
参见regex demo。请注意,我还将 \s*
添加到最后一个前瞻中,并将其与消耗 \s*
交换,如果在 :
之后的任何零个或多个空格之后存在 <br/>
,则匹配失败, ,
或 .
.
详情
\s*
- 零个或多个空格
(\.{2,}|[:,.])
- 第 1 组:两个或更多点,或 :
、,
或 .
(?!(?<=ό,)τι)
- 如果接下来的两个字符 τι
以 ό,
开头,则匹配失败
(?!(?<=\d.)\d)
- 如果下一个字符是前面有数字的数字和任何字符,则匹配失败(请注意 .
就足够了,因为[:,.]
已经匹配了字符 allowed/required,在这里,我们只需要“跳过”那个匹配的字符)
(?!\s*<br\s*/>)
- 如果存在零个或多个空格,<br
,零个或多个空格,/>
紧邻当前位置的右侧,则匹配失败的否定前瞻.
\s*
- 零个或多个空格。
如果 Wiktor 的大量环视模式对您来说太难 conceptualize/maintain/adapt,那么也许匹配和忽略技术对您来说会更容易。诚然,Wiktor 的模式针对性能进行了优化。
模式:
~ #starting pattern delimiter
\s* #zero or more whitespaces
(?: #start non-capturing group #1
(?: #start non-capturing group #2
\.\d+ #match float expression not requiring leading digits
| #or
\d{1,3}(?:,\d{3})+ #match number containing thousands separators
| #or
ό,τι #match literal greek phrase
| #or
<br\s*/> #match html break tag
) #end non-capturing group #2
(*SKIP)(*FAIL) #discard anything matched by group #2
| #or
( #start capture group #1
\.{3} #match three dots as ellipsis
| #or
[:,.] #match literal colon, comma, or dot
) #end capture group #1
) #end non-capturing group #1
\s* #zero or more whitespaces
~ #ending pattern delimiter
如果您希望扩展您的模式以包含更多取消资格的规则,只需添加另一个管道并添加一个子模式以匹配不需要的子字符串。
为确保三个符合条件的点作为省略号进行匹配,请在检查单个字符之前进行匹配。
代码:(Demo)
$text = <<<TEXT
Composition:80% Polyamide, 15% Elastane, 5% Wool.
Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
What about ,234,567.89?
Or....1mm one tenth of a millimeter?
ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time ,being a unique choice for those who want to stand out .Made of rubber.<br />- Softfoam floor<br />- Binding with laces
Specs:<br />• Something<br /><br />• Something else<br />• One more
Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play.<br />It consists of a cardigan and trousers ,made of soft fabric and have rib cuffs and legs for a better fit.<br /><br />• Normal fit<br /><br />• Cardigan :Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />• Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry,there'll be ...more!
TEXT;
echo preg_replace(
'~\s*(?:(?:\.\d+|\d{1,3}(?:,\d{3})+|ό,τι|<br\s*/>)(*SKIP)(*FAIL)|(\.{3}|[:,.]))\s*~',
' ',
$text
);
输出:
Composition: 80% Polyamide, 15% Elastane, 5% Wool. Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
What about ,234,567.89?
Or... .1mm one tenth of a millimeter?
ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time, being a unique choice for those who want to stand out. Made of rubber. <br />- Softfoam floor<br />- Binding with laces
Specs: <br />• Something<br /><br />• Something else<br />• One more
Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play. <br />It consists of a cardigan and trousers, made of soft fabric and have rib cuffs and legs for a better fit. <br /><br />• Normal fit<br /><br />• Cardigan: Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />• Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry, there'll be ...more!
我需要规范一些关于 .
、,
、:
符号 (否 space 之前和一个 space 之后)
我想出的正则表达式是这样的:
$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', ' ', $variation['DESCRIPTION']);
问题是这匹配了四种不应触及的情况:
- 任何十进制数,例如 5.5
- 任何千位分隔符,例如 4,500
- 希腊语中的“固定”短语,
ό,τι
- 省略号
...
- 基本上省略号是一个完全特殊的情况,我认为应该在单独的preg_replace
可能是?我的意思是,三个点应该被视为一件事,这意味着some text ...
确实应该匹配并转换为some text...
而不是some text. . .
特别是对于数字异常,我知道它可以通过一些负数 lookahead/lookbehind 来实现,但不幸的是我无法将它们组合到我当前的模式中。
This是一个fiddle让你检查(不应该匹配的情况在第2、3、4行)。
编辑:下面发布的两种解决方案都可以正常工作,但最终会在描述的最后一个句号之后添加 space。这不是什么大问题,因为在我的代码前面,我处理了 <br />
s 和 spaces 的开头和结尾描述,所以我把这个 preg_replace 移到了那个之前...
所以,我最终使用的最终代码是这样的:
$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])(?!(?<=\d.)\d)(?!(?<=ό,)τι)\s*#ui', ' ', $variation['DESCRIPTION']);
$variation['DESCRIPTION'] = preg_replace('#^\s*(<br />)*\s*|\s*(<br />)*\s*$#', '', $variation['DESCRIPTION']);
所以唯一剩下要做的就是更改此代码,使其按照我上面描述的方式处理省略号。
对于最后一项要求的任何帮助,我们将不胜感激! TIA
您可以添加两个包含回顾的先行:
\s*(\.{2,}|[:,.](?!(?<=ό,)τι)(?!(?<=\d.)\d))(?!\s*<br\s*/>)\s*
参见regex demo。请注意,我还将 \s*
添加到最后一个前瞻中,并将其与消耗 \s*
交换,如果在 :
之后的任何零个或多个空格之后存在 <br/>
,则匹配失败, ,
或 .
.
详情
\s*
- 零个或多个空格(\.{2,}|[:,.])
- 第 1 组:两个或更多点,或:
、,
或.
(?!(?<=ό,)τι)
- 如果接下来的两个字符τι
以ό,
开头,则匹配失败
(?!(?<=\d.)\d)
- 如果下一个字符是前面有数字的数字和任何字符,则匹配失败(请注意.
就足够了,因为[:,.]
已经匹配了字符 allowed/required,在这里,我们只需要“跳过”那个匹配的字符)(?!\s*<br\s*/>)
- 如果存在零个或多个空格,<br
,零个或多个空格,/>
紧邻当前位置的右侧,则匹配失败的否定前瞻.\s*
- 零个或多个空格。
如果 Wiktor 的大量环视模式对您来说太难 conceptualize/maintain/adapt,那么也许匹配和忽略技术对您来说会更容易。诚然,Wiktor 的模式针对性能进行了优化。
模式:
~ #starting pattern delimiter
\s* #zero or more whitespaces
(?: #start non-capturing group #1
(?: #start non-capturing group #2
\.\d+ #match float expression not requiring leading digits
| #or
\d{1,3}(?:,\d{3})+ #match number containing thousands separators
| #or
ό,τι #match literal greek phrase
| #or
<br\s*/> #match html break tag
) #end non-capturing group #2
(*SKIP)(*FAIL) #discard anything matched by group #2
| #or
( #start capture group #1
\.{3} #match three dots as ellipsis
| #or
[:,.] #match literal colon, comma, or dot
) #end capture group #1
) #end non-capturing group #1
\s* #zero or more whitespaces
~ #ending pattern delimiter
如果您希望扩展您的模式以包含更多取消资格的规则,只需添加另一个管道并添加一个子模式以匹配不需要的子字符串。
为确保三个符合条件的点作为省略号进行匹配,请在检查单个字符之前进行匹配。
代码:(Demo)
$text = <<<TEXT
Composition:80% Polyamide, 15% Elastane, 5% Wool.
Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
What about ,234,567.89?
Or....1mm one tenth of a millimeter?
ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time ,being a unique choice for those who want to stand out .Made of rubber.<br />- Softfoam floor<br />- Binding with laces
Specs:<br />• Something<br /><br />• Something else<br />• One more
Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play.<br />It consists of a cardigan and trousers ,made of soft fabric and have rib cuffs and legs for a better fit.<br /><br />• Normal fit<br /><br />• Cardigan :Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />• Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry,there'll be ...more!
TEXT;
echo preg_replace(
'~\s*(?:(?:\.\d+|\d{1,3}(?:,\d{3})+|ό,τι|<br\s*/>)(*SKIP)(*FAIL)|(\.{3}|[:,.]))\s*~',
' ',
$text
);
输出:
Composition: 80% Polyamide, 15% Elastane, 5% Wool. Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
What about ,234,567.89?
Or... .1mm one tenth of a millimeter?
ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time, being a unique choice for those who want to stand out. Made of rubber. <br />- Softfoam floor<br />- Binding with laces
Specs: <br />• Something<br /><br />• Something else<br />• One more
Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play. <br />It consists of a cardigan and trousers, made of soft fabric and have rib cuffs and legs for a better fit. <br /><br />• Normal fit<br /><br />• Cardigan: Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />• Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry, there'll be ...more!