正则表达式到 "normalize" 之后 SPACE 的用法。 , : 字符(和一些例外)

Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)

我需要规范一些关于 .,: 符号 (否 space 之前和一个 space 之后)

我想出的正则表达式是这样的:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', ' ', $variation['DESCRIPTION']);

问题是这匹配了四种不应触及的情况:

特别是对于数字异常,我知道它可以通过一些负数 lookahead/lookbehind 来实现,但不幸的是我无法将它们组合到我当前的模式中。

This是一个fiddle让你检查(不应该匹配的情况在第2、3、4行)。

编辑:下面发布的两种解决方案都可以正常工作,但最终会在描述的最后一个句号之后添加 space。这不是什么大问题,因为在我的代码前面,我处理了 <br />s 和 spaces 的开头和结尾描述,所以我把这个 preg_replace 移到了那个之前...

所以,我最终使用的最终代码是这样的:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])(?!(?<=\d.)\d)(?!(?<=ό,)τι)\s*#ui', ' ', $variation['DESCRIPTION']);
$variation['DESCRIPTION'] = preg_replace('#^\s*(<br />)*\s*|\s*(<br />)*\s*$#', '', $variation['DESCRIPTION']);

所以唯一剩下要做的就是更改此代码,使其按照我上面描述的方式处理省略号。

对于最后一项要求的任何帮助,我们将不胜感激! TIA

您可以添加两个包含回顾的先行:

\s*(\.{2,}|[:,.](?!(?<=ό,)τι)(?!(?<=\d.)\d))(?!\s*<br\s*/>)\s*

参见regex demo。请注意,我还将 \s* 添加到最后一个前瞻中,并将其与消耗 \s* 交换,如果在 : 之后的任何零个或多个空格之后存在 <br/> ,则匹配失败, ,..

详情

  • \s* - 零个或多个空格
  • (\.{2,}|[:,.]) - 第 1 组:两个或更多点,或 :,.
  • (?!(?<=ό,)τι) - 如果接下来的两个字符 τιό,
  • 开头,则匹配失败
  • (?!(?<=\d.)\d) - 如果下一个字符是前面有数字的数字和任何字符,则匹配失败(请注意 . 就足够了,因为[:,.] 已经匹配了字符 allowed/required,在这里,我们只需要“跳过”那个匹配的字符)
  • (?!\s*<br\s*/>) - 如果存在零个或多个空格,<br,零个或多个空格,/> 紧邻当前位置的右侧,则匹配失败的否定前瞻.
  • \s* - 零个或多个空格。

如果 Wiktor 的大量环视模式对您来说太难 conceptualize/maintain/adapt,那么也许匹配和忽略技术对您来说会更容易。诚然,Wiktor 的模式针对性能进行了优化。

模式:

~                        #starting pattern delimiter 
\s*                      #zero or more whitespaces
(?:                      #start non-capturing group #1
  (?:                    #start non-capturing group #2
    \.\d+                #match float expression not requiring leading digits
    |                    #or
    \d{1,3}(?:,\d{3})+   #match number containing thousands separators
    |                    #or
    ό,τι                 #match literal greek phrase
    |                    #or
    <br\s*/>             #match html break tag
  )                      #end non-capturing group #2
  (*SKIP)(*FAIL)         #discard anything matched by group #2
  |                      #or
  (                      #start capture group #1
    \.{3}                #match three dots as ellipsis
    |                    #or
    [:,.]                #match literal colon, comma, or dot
  )                      #end capture group #1
)                        #end non-capturing group #1
\s*                      #zero or more whitespaces
~                        #ending pattern delimiter

如果您希望扩展您的模式以包含更多取消资格的规则,只需添加另一个管道并添加一个子模式以匹配不需要的子字符串。

为确保三个符合条件的点作为省略号进行匹配,请在检查单个字符之前进行匹配。

代码:(Demo)

$text = <<<TEXT
Composition:80% Polyamide,   15% Elastane, 5% Wool.
Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER

What about ,234,567.89?
Or....1mm one tenth of a millimeter?

ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time ,being a unique choice for those who want to stand out .Made of rubber.<br />- Softfoam floor<br />- Binding with laces

Specs:<br />&bull; Something<br /><br />&bull; Something else<br />&bull; One more

Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play.<br />It consists of a cardigan and trousers ,made of soft fabric and have rib cuffs and legs for a better fit.<br /><br />&bull; Normal fit<br /><br />&bull; Cardigan  :Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />&bull; Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry,there'll be ...more!
TEXT;

echo preg_replace(
         '~\s*(?:(?:\.\d+|\d{1,3}(?:,\d{3})+|ό,τι|<br\s*/>)(*SKIP)(*FAIL)|(\.{3}|[:,.]))\s*~',
         ' ',
         $text
     );

输出:

Composition: 80% Polyamide, 15% Elastane, 5% Wool. Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER

What about ,234,567.89?
Or... .1mm one tenth of a millimeter?

ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time, being a unique choice for those who want to stand out. Made of rubber. <br />- Softfoam floor<br />- Binding with laces

Specs: <br />&bull; Something<br /><br />&bull; Something else<br />&bull; One more

Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play. <br />It consists of a cardigan and trousers, made of soft fabric and have rib cuffs and legs for a better fit. <br /><br />&bull; Normal fit<br /><br />&bull; Cardigan: Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />&bull; Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry, there'll be ...more!