使用正则表达式提取数据

Data Extraction using Regex

我在文本文件中有数据"file.txt"

Recipes & Menus
Expert Advice
Ingredients
Holidays & Events
Community
Video
SUMMER COOKING
Lentil and Brown Rice Soup
Gourmet January 1991
3.5/4
reviews (83)
90%
make it again
Some soups genuinely do inspire a devotion akin to love, and this is one of
them. In the cold of winter, when Gourmet editors ponder the matter of what soup
Cook
Reviews (83)
YIELD: Makes about 14 cups, serving 6 to 8
Ingredients
5 cups chicken broth
1 1/2 cups lentils, picked over and rinsed
1 cup brown rice
a 32- to 35-ounce can tomatoes, drained, reserving the juice, and chopped
3 carrots, halved lengthwise and cut crosswise into 1/4-inch pieces
1 onion, chopped
1 stalk of celery, chopped
3 garlic cloves, minced
1/2 teaspoon crumbled dried basil
1/2 teaspoon crumbled dried orégano
1/4 teaspoon crumbled dried thyme
1 bay leaf
1/2 cup minced fresh parsley leaves
2 tablespoons cider vinegar, or to taste
Preparation
In a heavy kettle combine the broth, 3 cups water, the lentils, the rice, the tomatoes with the reserved juice,

我想提取 IngredientsPreparation 之间的数据。
我为此编写了以下正则表达式:-

(?s).*?Ingredients(.*?)Preparation.*

但是它在
file.txt和Preparation[=28=的第3行以斜体字提取Ingredients之间的数据] 但不在 IngredientsPreparation
之间的数据之间 我应该对我的正则表达式代码做哪些更改来解决此问题?
提前致谢!

(?s).*?[*]{2}Ingredients[*]{2}(.*?)[*]{2}Preparation[*]{2}.*

[*]{2}告诉正则表达式你想要列表中的一个字符(这里是单个 *)两次 {2}

我更喜欢使用字符 类 而不是转义,我发现它们比这个更具可读性:

(?s).*?\*{2}Ingredients\*{2}(.*?)\*{2}Preparation\*{2}.*

根据您使用的语言,您可能还需要转义反斜杠。

您可以使用前瞻来检查每行不是 Ingredients。通过这种方式,您将测试数量限制为仅行的开头(而不是测试每个字符):

(?m)^Ingredients\R((?:(?!Ingredients$).*\R)+?)Preparation$ 

demo

图案详情:

(?m)             # switch on the multiline mode (^ and $ match the limit of the line)
^Ingredients\R   # "Ingredients" at the start of the line followed by a new line
(   # capture group 1
    (?:          # open a non-capturing group
        (?!Ingredients$) # negative lookahead to check that the line is not "Ingredients"
        .*\R             # the line
    )+? # repeat until "Preparation"
)
Preparation$

注意:由于您没有说明您使用的是什么正则表达式引擎,因此可能 \R 不受支持。在这种情况下,将其替换为 \r?\n

您可以使用惰性量词 .*? 和第二个 .*:

(?s).*\bIngredients\b(.*?)\bPreparation\b

demo

或者你可以使用 tempered greedy token 然后你不需要第一个 .*:

(?s)\bIngredients\b(?:(?!\b(?:Ingredients|Preparation)\b).)*\bPreparation\b

demo

试着让你的第一个 .* 变得贪婪。它会吃掉所有 Ingredients 直到 Preparation:

之前的最后一个
(?s).*Ingredients(.*?)Preparation.*

演示:https://regex101.com/r/mQ5eK5/1