通过正则表达式使用替代方法连接术语

Question

Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word and in regex as partition to separate two groups of the sentence. For example:

Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'

Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'

What Regex I have tried:

import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin." 
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))

正则表达式能够捕获组，但我从 substitute 方法行收到 TypeError: 'NoneType' object is not subscriptable 错误。任何类型的建议或帮助执行上述问题将不胜感激。

Answer 1

拆分解决方案

虽然这不是正则表达式解决方案，但确实有效：

from string import punctuation

x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
    if word == "and":
        # strip punctuation or we will get skin. instead of skin
        x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))

输出为：

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

此解决方案避免直接插入到列表中，因为这会在您迭代时导致索引出现问题。相反，我们将列表中的第一个“and”替换为“synthesis and”，将第二个“and”替换为“skin and”，然后重新加入拆分字符串。

正则表达式解决方案

如果您坚持使用正则表达式解决方案，我建议将 re.findall 与包含 单个和 的模式一起使用，因为这对问题更普遍：

from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

我们再次使用 strip(punctuation) 因为 skin. 被捕获了：我们不想丢失句子结尾处的标点符号，但是我们确实想把它丢在句子里。

这是我们的模式：

(.*?)\sand\s(.*?)\s([^\s]+)

(.*?)\s：捕获“and”之前的所有内容，包括space
\s(.*?)\s：捕获紧跟在“and”之后的单词
([^\s]+)：捕获任何不是 space 的内容，直到下一个 space（即“and”之后的第二个词）。这确保我们也能捕获标点符号。

Answer 2

您无需导入 punctuation，一个正则表达式即可：

import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)

结果：Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

参见 Python proof。

使用re.DOTALL允许点匹配换行符。
在末尾使用\b分界线去除标点符号，并用([_\W]*).
将其捕获到一个单独的组中使用 \s+ 到 trim 结果中任意数量的空白字符。
[^\s] 与 \S 相同，缩短它。

见regex proof。

解释

--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  and                      'and'
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    [_\W]*                   any character of: '_', non-word
                             characters (all but a-z, A-Z, 0-9, _) (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of

通过正则表达式使用替代方法连接术语

Concatenate the term using substitute method via regex

python

regex

string

regex-group

python-re

拆分解决方案

正则表达式解决方案