使用正则表达式提取数据

Question

我在文本文件中有数据"file.txt"

Recipes & Menus
Expert Advice
Ingredients
Holidays & Events
Community
Video
SUMMER COOKING
Lentil and Brown Rice Soup
Gourmet January 1991
3.5/4
reviews (83)
90%
make it again
Some soups genuinely do inspire a devotion akin to love, and this is one of
them. In the cold of winter, when Gourmet editors ponder the matter of what soup
Cook
Reviews (83)
YIELD: Makes about 14 cups, serving 6 to 8
Ingredients
5 cups chicken broth
1 1/2 cups lentils, picked over and rinsed
1 cup brown rice
a 32- to 35-ounce can tomatoes, drained, reserving the juice, and chopped
3 carrots, halved lengthwise and cut crosswise into 1/4-inch pieces
1 onion, chopped
1 stalk of celery, chopped
3 garlic cloves, minced
1/2 teaspoon crumbled dried basil
1/2 teaspoon crumbled dried orégano
1/4 teaspoon crumbled dried thyme
1 bay leaf
1/2 cup minced fresh parsley leaves
2 tablespoons cider vinegar, or to taste
Preparation
In a heavy kettle combine the broth, 3 cups water, the lentils, the rice, the tomatoes with the reserved juice,

我想提取 Ingredients 和 Preparation 之间的数据。
我为此编写了以下正则表达式：-

(?s).*?Ingredients(.*?)Preparation.*

但是它在
file.txt和Preparation[=28=的第3行以斜体字提取Ingredients之间的数据] 但不在 Ingredients 和 Preparation
之间的数据之间我应该对我的正则表达式代码做哪些更改来解决此问题？
提前致谢！

Answer 1

(?s).*?[*]{2}Ingredients[*]{2}(.*?)[*]{2}Preparation[*]{2}.*

[*]{2}告诉正则表达式你想要列表中的一个字符（这里是单个 *）两次 {2}。

我更喜欢使用字符类而不是转义，我发现它们比这个更具可读性：

(?s).*?\*{2}Ingredients\*{2}(.*?)\*{2}Preparation\*{2}.*

根据您使用的语言，您可能还需要转义反斜杠。

Answer 2

您可以使用前瞻来检查每行不是 Ingredients。通过这种方式，您将测试数量限制为仅行的开头（而不是测试每个字符）：

(?m)^Ingredients\R((?:(?!Ingredients$).*\R)+?)Preparation$

demo

图案详情：

(?m)             # switch on the multiline mode (^ and $ match the limit of the line)
^Ingredients\R   # "Ingredients" at the start of the line followed by a new line
(   # capture group 1
    (?:          # open a non-capturing group
        (?!Ingredients$) # negative lookahead to check that the line is not "Ingredients"
        .*\R             # the line
    )+? # repeat until "Preparation"
)
Preparation$

注意：由于您没有说明您使用的是什么正则表达式引擎，因此可能 \R 不受支持。在这种情况下，将其替换为 \r?\n。

Answer 3

您可以使用惰性量词 .*? 和第二个 .*:

(?s).*\bIngredients\b(.*?)\bPreparation\b

见demo

或者你可以使用 tempered greedy token 然后你不需要第一个 .*:

(?s)\bIngredients\b(?:(?!\b(?:Ingredients|Preparation)\b).)*\bPreparation\b

见demo

Answer 4

试着让你的第一个 .* 变得贪婪。它会吃掉所有 Ingredients 直到 Preparation:

之前的最后一个

(?s).*Ingredients(.*?)Preparation.*

演示：https://regex101.com/r/mQ5eK5/1

使用正则表达式提取数据

Data Extraction using Regex

regex

data-extraction