在行内搜索特定的短语模式。 python

Question

我制定了一些规则，需要在文件中搜索。这些规则本质上是包含未知数量单词的短语。例如，

mutant...causes(...)GS

这是一个短语，我想在我的文件中搜索它。 ... 意味着这里应该有几个单词（即在这个空隙中），而 (...) 意味着 may/may 在这个空隙中没有单词。 GS这里是我知道的固定字符串变量。

基本上，我通过查看许多此类文件制定了这些规则，它们告诉我某个特定文件可以满足我的要求。

问题是间隙可以有任意（小）数量的单词。甚至可以从其中一个空隙开始新的一行。因此，我不能进行相同的字符串匹配。

一些示例文本 -

!Series_summary "To better understand how the expression of a *mutant gene that causes ALS* can perturb the normal phenotype of astrocytes, and to identify genes that may

这里的 GS 是 ALS（已定义），带星号的文本应该是规则 mutant...causes(...)GS

的正匹配项

!Series_overall_design "The analysis includes 9 samples of genomic DNA from isolated splenic CD11c+ dendritic cells (>95% pure) per group. The two groups are neonates born to mothers with *induced allergy to ovalbumin*, and normal control neonates. All neonates are genetically and environmentally identical, and allergen-naive."

这里的 GS 是卵清蛋白（已定义），带星号的文本应该是规则的正匹配项 induced...to GS

我是 python 的编程初学者，所以任何帮助都会很棒！！

Answer 1

以下应该可以帮助您入门，它将读取您的文件并使用 Python regular expression 显示所有可能的匹配行，这将帮助您确定它匹配所有正确的行：

import re

with open('input.txt', 'r') as f_input:
    data = f_input.read()
    print re.findall(r'(mutant\s.*?\scauses.*?GS)', data, re.S)

然后只搜索一个匹配项，将 findall 更改为 search:

import re

with open('input.txt', 'r') as f_input:
    data = f_input.read()
    if re.search(r'(mutant\s.*?\scauses.*?GS)', data, re.S):
        print 'found'

要对许多此类文件执行此操作，您可以按如下方式进行调整：

import re
import glob

for filename in glob.glob('*.*'):
    with open(filename, 'r') as f_input:
        data = f_input.read()
        if re.search(r'mutant\s.*?\scauses.*?GS', data, re.S):
            print "'{}' matches".format(filename)

在行内搜索特定的短语模式。 python

Searching for specific phrase pattern within lines. python

python

regex

search

nlp

match-phrase