将句子中的变体名称字符串与 python3 中的正则表达式匹配

Question

我想将 python3 中以下文本中的 c.[1210-12t[7];1408a>g] 与正则表达式匹配：

It was frequently associated with the c.[1210-12t[7];1408a>g] (t7-p.val470) allele and this cftr genetic background could not explain the putative pathogenicity of this variant.

但是我根据场景只知道所需单词的前缀c.[1210-12t[7]。因此，我尝试了正则表达式模式 c\.\[1210-12t\[7\].*\b 但它匹配了句子的一半：

c.[1210-12t[7];1408a>g] (t7-p.val470) allele and this cftr genetic background could not explain the putative pathogenicity of this variant.

你能帮我修正我的正则表达式吗？谢谢！

Answer 1

.* 的匹配是罪魁祸首，因为它会尽可能匹配所有内容。由于您要捕获的内容没有 spaces，因此您可以通过使用非贪婪形式 .*? 直到最早的 space 来捕获直到 space 部分] \s 或字符串结尾 $.

c\..*?\s

或者如果它可以是句子的最后一部分：

c\..*?(?:\s|$)

或者如果你想捕获组：

(c\..*?)(?:\s|$)

样本运行:

其中：

外层(...) - 抓团
c - 匹配字母“c”
\. - 匹配句点字符
.*? - 以非贪婪方式匹配任意字符
外部(?:...) - 非捕获组
\s|$ - 匹配 space 字符或字符串结尾。由于较早的模式是非贪婪的，因此这将匹配最早的 space 个字符。

将句子中的变体名称字符串与 python3 中的正则表达式匹配

Match an variant name string in a sentence with regex in python3

regex

python-3.x

python-3.8