通过正则表达式使用替代方法连接术语

Concatenate the term using substitute method via regex

Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word and in regex as partition to separate two groups of the sentence. For example:

Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'

Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'

What Regex I have tried:

import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin." 
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))

正则表达式能够捕获组,但我从 substitute 方法行收到 TypeError: 'NoneType' object is not subscriptable 错误。任何类型的建议或帮助执行上述问题将不胜感激。

拆分解决方案

虽然这不是正则表达式解决方案,但确实有效:

from string import punctuation

x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
    if word == "and":
        # strip punctuation or we will get skin. instead of skin
        x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))

输出为:

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

此解决方案避免直接插入到列表中,因为这会在您迭代时导致索引出现问题。相反,我们将列表中的第一个“and”替换为“synthesis and”,将第二个“and”替换为“skin and”,然后重新加入拆分字符串。

正则表达式解决方案

如果您坚持使用正则表达式解决方案,我建议将 re.findall 与包含 单个和 的模式一起使用,因为这对问题更普遍:

from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

我们再次使用 strip(punctuation) 因为 skin. 被捕获了:我们不想丢失句子 结尾 处的标点符号,但是我们确实想把它丢在句子里。

这是我们的模式:

(.*?)\sand\s(.*?)\s([^\s]+)
  1. (.*?)\s:捕获“and”之前的所有内容,包括space
  2. \s(.*?)\s:捕获紧跟在“and”之后的单词
  3. ([^\s]+):捕获任何不是 space 的内容,直到下一个 space(即“and”之后的第二个词)。这确保我们也能捕获标点符号。

您无需导入 punctuation,一个正则表达式即可:

import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)

结果Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

参见 Python proof

使用re.DOTALL允许点匹配换行符。
在末尾使用\b分界线去除标点符号,并用([_\W]*).
将其捕获到一个单独的组中 使用 \s+ 到 trim 结果中任意数量的空白字符。
[^\s]\S 相同,缩短它。

regex proof

解释

--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  and                      'and'
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    [_\W]*                   any character of: '_', non-word
                             characters (all but a-z, A-Z, 0-9, _) (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of