姓氏的正则表达式 - 如何提高效率和长度并只允许 50 个字符

regex for last name - how to improve efficiency and length and only allow 50 characters

我有这个正则表达式来处理带有连字符或撇号的姓氏,但它真的很长,我不确定如何在不进行更多研究的情况下轻松修复它。我已经在这上面花了几个小时,虽然它有效,但我想稍微清理一下。我将其包含在 json 架构中。我还想将字符总数限制为 50 个。我知道您可以使用 {1,50} 来做到这一点,但我不知道如何将其用于复杂的正则表达式。这是我目前拥有的:

^[a-zA-Z]+((-[a-zA-Z]+)*('[a-zA-Z]+)*|('[a-zA-Z]+)*(-[a-zA-Z]+)*)$

和我的测试数据:

5               -- should fail
foster              -- should match
foster steve            -- should match (EDIT: should not match)
foster-morrison             -- should match
hello               -- should match
*RKER(($(#$)#$#L$KLK#$*     -- should fail
dfkfsdkfskdfjksjfksfksjfskjfksjfskfksjfksfksfskfd   -- should match
jkddkdkdkdkdkdkdkdkfdkd-ffddfdfdfdgggfgfgfgggggggg  -- should match
jkddkdkdkdkdkdkdkdkdkdkdkskdldkfdlkfdfkdfkdlkdkdkd-ffddfdfdfdgggfgfgfgggggggd   -- should fail
dkfkerksf------aaa-----     -- should fail
test---me           -- should fail
foster-mo           -- should match
f-morrison          -- should match
griffith-joiner         -- should match
test-               -- should fail
-dkd                -- should fail
d'andre             -- should match
d'andre-jordan          -- should match
jordan-d'andre          -- should match

你可以这样使用:

^(?=.{1,50}$)[a-zA-Z]+(?:'[a-zA-Z]+)?(?:[- ][a-zA-Z]+(?:'[a-zA-Z]+)?)?$

Regex demo.

如果您想支持多个(但不是连续的)hyphens/spaces,您可以将最后一个 ? 替换为 *

您当前的模式与所有示例不匹配的原因是因为 alternation | 它会匹配(其中 chars 是一个- zA-Z):

chars-chars'charschars'chars-chars

这只会匹配 d'andre-jordanjordan-d'andre 并且不会考虑 space 或仅匹配大写或小写字符 a-

要声明 1 - 50 个字符的长度,您可以使用正向先行 (?=.{1,50}$)

您可以使用重复模式代替交替,其中 ' 不能在彼此之后出现 2 次并匹配中间的连字符或 space。

^(?=.{1,50}$)[a-zA-Z]+(?:'[a-zA-Z]+)*(?:[- ][a-zA-Z]+(?:'[a-zA-Z])*)*$

Regex demo

作为每个请求的替代方案,我的第一个建议模式是:

^[a-zA-Z']+(?:[- ][a-zA-Z']+)*$

Regex demo

(?=^.{1,50}$)^[a-zA-Z]+([ \-'][a-zA-Z]+)*?$

清理匹配所有测试用例的正则表达式(以及一些额外的测试用例,如 test'''me),并断言 50 个字符限制。 请注意,这也匹配更长、更复杂的姓氏, 如 jordan-d'andre-joe-b'bob;如果这不是我们想要的行为,请随时告诉我。

效果如何?

The regex is in 3 main chunks:
(?=^.{1,50}$)
             ^[a-zA-Z]+
                       ([ \-'][a-zA-Z]+)*?$


First chunk breakdown:
(?=^.{1,50}$)

(?=         )    - positive lookahead, asserts that the following holds true
   ^       $     - ensure that between the start and end of the line...
    .            - ...any character...
     {1,50}      - ...exists, and there's between 1 and 50 of the "any character" token


Second chunk breakdown:
^[a-zA-Z]+

^             - assert that this begins at the start of the line
 [a-zA-Z]     - match any letter
         +    - get as many as you can, but be sure to get at least one


Third chunk breakdown:
([ \-'][a-zA-Z]+)*?$

                   $    - assert this happens at the end of the line
(               )*?     - match this entire group zero or more times, but only as much as is necessary. 
                          - in specific, *? lets you match between zero and unlimited times (*) as few times as possible (?).
                            this is because for names WITHOUT spaces, apostrophes, or hyphens, this section of 
                            the regex can be discarded (hence, zero or more times) which leaves behind only
                            the first and second parts of the regex - character count and name. however, in names that include many 
                            iterations and combinations of spaces, hyphens, and apostrophes, the regex can and will grow as needed,
                            continuing to match them as long as it doesn't hit the end of the line.
                          - note that without the $, this never matches, and will always miss any name with spaces, apostrophes, or hyphens.
 [ \-']                 - match a space, a literal hyphen, or an apostrophe, once - no back-to-back symbols
       [a-zA-Z]+        - match one or more letter

Try it here!


我相信我在之前的编辑中也看到您对清理正则表达式感兴趣,但我不确定这是否包括效率。但是,无论哪种方式,如果您要计算字符数,重要的是要意识到灾难性回溯的可能性。 This article 描述了灾难性的回溯,即正则表达式一遍又一遍地跟踪自身,试图找到不存在的匹配项。我在处理这个正则表达式时实际上遇到过一个;如果您对字符限制使用后向而不是前向,则正则表达式 运行 的速度如此之慢,以至于许多调试器将根本拒绝 运行 它。

始终尽可能准确地搜索您要搜索的内容。虽然我没有在您的示例中看到它(对您好!)如果您担心关于效率,通配符在您的正则表达式中可能很粗糙。 . 是一个强大的工具,但更精确是值得的;更多的通配符通常意味着更多的回溯。

让我们来看看这可能是一个真正的问题的情况。假设我们有一个字符串,abaabababbabababaabababb。该字符串有一个或多个 a,后跟一个或多个 b,并且此模式恰好重复十次。有效字符串包括:

abababababababababab
aaaaaaaaaabababababababababab
aabbaabbaabbaabbaabbaabbaabbaabbaabbaabb

我们知道"one or more a"是a+,"one or more b"是b+,"repeating pattern"是(a+b+),“10次”是(a+b+){10} - 酷! This regex matches all 3 strings in 96 steps.

但是...如果我们想扩展它以便任何两个字符都能工作,而不仅仅是 a 和 b,该怎么办?太诱人了 (.+.+){10} 看起来天真无邪吧?没有。如果没有灾难性的回溯,这甚至无法处理最短的有效字符串

一旦它到达字符串的末尾并找到 "no match",最后的 . 将一个字符让给前一个 . 并检查它是否有效(不是)。他们一次又一次地这样做,一直沿着链条向下,随着您的弦越长,要尝试的东西呈指数级增加。即使存在有效匹配,系统也可能需要几秒钟甚至几分钟才能找到它; regex101 只是拒绝尝试。但是,如果您删除一个字母,它就足够短了,网站可以考虑,表明需要超过 1.5 million steps 才能确定没有匹配项。如果您将其部署在一百个字符的无效字符串上,上帝会帮助您。

通配符加上无限量词会导致一些严重的回溯问题。你越精确,你就会做得越好。祝你好运!

我会用简单的方法

(?m)^(?:(?<!^)(?:(?!||)(['])|(?!||)([-])|(?!||)([ ]))(?!$)|[a-zA-Z]){1,50}$

https://regex101.com/r/TiRqeZ/1

 (?m)
 ^ 
 (?:
      (?<! ^ )
      (?:
           (?!  |  |  )
           ( ['] )                       # (1)
        |  (?!  |  |  )
           ( [-] )                       # (2)
        |  (?!  |  |  )
           ( [ ] )                       # (3)
      )
      (?! $ )
   |  
      [a-zA-Z] 
 ){1,50}
 $