通过正则表达式使用替代方法连接术语
Concatenate the term using substitute method via regex
Summary of problem:
I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word and
in regex as partition to separate two groups of the sentence. For example:
Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'
What Regex I have tried:
import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin."
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))
正则表达式能够捕获组,但我从 substitute
方法行收到 TypeError: 'NoneType' object is not subscriptable
错误。任何类型的建议或帮助执行上述问题将不胜感激。
拆分解决方案
虽然这不是正则表达式解决方案,但确实有效:
from string import punctuation
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
if word == "and":
# strip punctuation or we will get skin. instead of skin
x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))
输出为:
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
此解决方案避免直接插入到列表中,因为这会在您迭代时导致索引出现问题。相反,我们将列表中的第一个“and”替换为“synthesis and”,将第二个“and”替换为“skin and”,然后重新加入拆分字符串。
正则表达式解决方案
如果您坚持使用正则表达式解决方案,我建议将 re.findall
与包含 单个和 的模式一起使用,因为这对问题更普遍:
from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
我们再次使用 strip(punctuation)
因为 skin.
被捕获了:我们不想丢失句子 结尾 处的标点符号,但是我们确实想把它丢在句子里。
这是我们的模式:
(.*?)\sand\s(.*?)\s([^\s]+)
(.*?)\s
:捕获“and”之前的所有内容,包括space
\s(.*?)\s
:捕获紧跟在“and”之后的单词
([^\s]+)
:捕获任何不是 space 的内容,直到下一个 space(即“and”之后的第二个词)。这确保我们也能捕获标点符号。
您无需导入 punctuation
,一个正则表达式即可:
import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)
结果:Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
参见 Python proof。
使用re.DOTALL
允许点匹配换行符。
在末尾使用\b
分界线去除标点符号,并用([_\W]*)
.
将其捕获到一个单独的组中
使用 \s+
到 trim 结果中任意数量的空白字符。
[^\s]
与 \S
相同,缩短它。
解释
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
[_\W]* any character of: '_', non-word
characters (all but a-z, A-Z, 0-9, _) (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of
Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word
and
in regex as partition to separate two groups of the sentence. For example:
Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'
What Regex I have tried:
import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin."
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))
正则表达式能够捕获组,但我从 substitute
方法行收到 TypeError: 'NoneType' object is not subscriptable
错误。任何类型的建议或帮助执行上述问题将不胜感激。
拆分解决方案
虽然这不是正则表达式解决方案,但确实有效:
from string import punctuation
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
if word == "and":
# strip punctuation or we will get skin. instead of skin
x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))
输出为:
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
此解决方案避免直接插入到列表中,因为这会在您迭代时导致索引出现问题。相反,我们将列表中的第一个“and”替换为“synthesis and”,将第二个“and”替换为“skin and”,然后重新加入拆分字符串。
正则表达式解决方案
如果您坚持使用正则表达式解决方案,我建议将 re.findall
与包含 单个和 的模式一起使用,因为这对问题更普遍:
from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
我们再次使用 strip(punctuation)
因为 skin.
被捕获了:我们不想丢失句子 结尾 处的标点符号,但是我们确实想把它丢在句子里。
这是我们的模式:
(.*?)\sand\s(.*?)\s([^\s]+)
(.*?)\s
:捕获“and”之前的所有内容,包括space\s(.*?)\s
:捕获紧跟在“and”之后的单词([^\s]+)
:捕获任何不是 space 的内容,直到下一个 space(即“and”之后的第二个词)。这确保我们也能捕获标点符号。
您无需导入 punctuation
,一个正则表达式即可:
import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)
结果:Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
参见 Python proof。
使用re.DOTALL
允许点匹配换行符。
在末尾使用\b
分界线去除标点符号,并用([_\W]*)
.
将其捕获到一个单独的组中
使用 \s+
到 trim 结果中任意数量的空白字符。
[^\s]
与 \S
相同,缩短它。
解释
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
[_\W]* any character of: '_', non-word
characters (all but a-z, A-Z, 0-9, _) (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of