python 将文本从特定模式打印到另一个模式的正则表达式，但条件是特定字符串应存在于两者之间

Question

所以我有一个这样的文件：

<html>
    <div>
        <h1>HOiihilasdl</h1>
    </div>
    <script src=https://example.com/file.js></script>
    <script>
        blabla
        blabla
        blabla
        blabla
        blabla
    </script>
    <script src=https://example.com/file.js></script>
    <script>
        blabla
        blabla
        cow
        blabla
        blabla
    </script>
</html>

我想打印从 <script> 到 </script> 但只有当单词 cow 存在于两者之间时才打印（我想使用 python 正则表达式来做到这一点）。

输出将如下所示：

    <script>
        blabla
        blabla
        cow
        blabla
        blabla
    </script>

我搜索了很多答案，但没有找到解决我问题的答案。

我也想知道如果 <script> 和 </script> 之间存在单词“cow”是否有可能只是 return 我的“脚本”

我正在使用 Python 3.10.4

Answer 1

我不完全确定你在这里要做什么。如果您只是想解决您在问题中明确提出的场景，那么解决方案可能如下所示，您可以在其中遍历文件的每一行，并跟踪 opening/closing 标记。每当遇到结束标记时，您就开始存储行。如果在下一个结束标记之前没有找到诸如“cow”之类的模式，则在遇到下一个开始标记时重新开始搜索。

注意：下面的解决方案不适用于嵌套标签，但可以很容易地进行更改。

def find_pattern(file, pattern):
    with open(file, 'r') as f:
        lines = []
        start = False
        found_pattern = False

        # Iterate through the lines in the file
        for line in f:
            # Remove the newline character
            line = line.replace("\n", "")

            # Remove the leading whitespaces
            stripped_line = line.lstrip()

            # If we met the start of a tag such as <script>, we need to keep track of the lines until we met the end tag
            if start is False and stripped_line.startswith("<") and not "</" in line:
                start = True

            # We only append lines, whenever we start keeping track
            if start:
                lines.append(line)
        
            # If we find the pattern, we set a flag to true
            if pattern in line:
                found_pattern = True
        
            # If we met an end tag, we have two possibilities:
            # If we found the pattern we break and print. Otherwise, we keep searching.
            if stripped_line.startswith("</"):
                if found_pattern:
                    break
                else:
                    lines = []
                    start = False  

    # If the lines are not empty, i.e. we found the pattern, we print them
    if lines:
        for line in lines:
            print(line)

find_pattern(file="t.txt", pattern="cow")

Output:
    <script>
        blabla
        blabla
        cow
        blabla
        blabla
    </script>

python 将文本从特定模式打印到另一个模式的正则表达式，但条件是特定字符串应存在于两者之间

python regex to print text from a specific pattern to another pattern, but in condition that a specific string should exist in between

html

python

python-3.10