使用正则表达式提取数据
Data Extraction using Regex
我在文本文件中有数据"file.txt"
Recipes & Menus
Expert Advice
Ingredients
Holidays & Events
Community
Video
SUMMER COOKING
Lentil and Brown Rice Soup
Gourmet January 1991
3.5/4
reviews (83)
90%
make it again
Some soups genuinely do inspire a devotion akin to love, and this is one of
them. In the cold of winter, when Gourmet editors ponder the matter of what soup
Cook
Reviews (83)
YIELD: Makes about 14 cups, serving 6 to 8
Ingredients
5 cups chicken broth
1 1/2 cups lentils, picked over and rinsed
1 cup brown rice
a 32- to 35-ounce can tomatoes, drained, reserving the juice, and chopped
3 carrots, halved lengthwise and cut crosswise into 1/4-inch pieces
1 onion, chopped
1 stalk of celery, chopped
3 garlic cloves, minced
1/2 teaspoon crumbled dried basil
1/2 teaspoon crumbled dried orégano
1/4 teaspoon crumbled dried thyme
1 bay leaf
1/2 cup minced fresh parsley leaves
2 tablespoons cider vinegar, or to taste
Preparation
In a heavy kettle combine the broth, 3 cups water, the lentils, the rice, the tomatoes with the reserved juice,
我想提取 Ingredients 和 Preparation 之间的数据。
我为此编写了以下正则表达式:-
(?s).*?Ingredients(.*?)Preparation.*
但是它在
file.txt和Preparation[=28=的第3行以斜体字提取Ingredients之间的数据] 但不在 Ingredients 和 Preparation
之间的数据之间
我应该对我的正则表达式代码做哪些更改来解决此问题?
提前致谢!
(?s).*?[*]{2}Ingredients[*]{2}(.*?)[*]{2}Preparation[*]{2}.*
[*]{2}
告诉正则表达式你想要列表中的一个字符(这里是单个 *
)两次 {2}
。
我更喜欢使用字符 类 而不是转义,我发现它们比这个更具可读性:
(?s).*?\*{2}Ingredients\*{2}(.*?)\*{2}Preparation\*{2}.*
根据您使用的语言,您可能还需要转义反斜杠。
您可以使用前瞻来检查每行不是 Ingredients
。通过这种方式,您将测试数量限制为仅行的开头(而不是测试每个字符):
(?m)^Ingredients\R((?:(?!Ingredients$).*\R)+?)Preparation$
图案详情:
(?m) # switch on the multiline mode (^ and $ match the limit of the line)
^Ingredients\R # "Ingredients" at the start of the line followed by a new line
( # capture group 1
(?: # open a non-capturing group
(?!Ingredients$) # negative lookahead to check that the line is not "Ingredients"
.*\R # the line
)+? # repeat until "Preparation"
)
Preparation$
注意:由于您没有说明您使用的是什么正则表达式引擎,因此可能 \R
不受支持。在这种情况下,将其替换为 \r?\n
。
您可以使用惰性量词 .*?
和第二个 .*
:
(?s).*\bIngredients\b(.*?)\bPreparation\b
见demo
或者你可以使用 tempered greedy token 然后你不需要第一个 .*
:
(?s)\bIngredients\b(?:(?!\b(?:Ingredients|Preparation)\b).)*\bPreparation\b
见demo
试着让你的第一个 .*
变得贪婪。它会吃掉所有 Ingredients
直到 Preparation
:
之前的最后一个
(?s).*Ingredients(.*?)Preparation.*
我在文本文件中有数据"file.txt"
Recipes & Menus
Expert Advice
Ingredients
Holidays & Events
Community
Video
SUMMER COOKING
Lentil and Brown Rice Soup
Gourmet January 1991
3.5/4
reviews (83)
90%
make it again
Some soups genuinely do inspire a devotion akin to love, and this is one of
them. In the cold of winter, when Gourmet editors ponder the matter of what soup
Cook
Reviews (83)
YIELD: Makes about 14 cups, serving 6 to 8
Ingredients
5 cups chicken broth
1 1/2 cups lentils, picked over and rinsed
1 cup brown rice
a 32- to 35-ounce can tomatoes, drained, reserving the juice, and chopped
3 carrots, halved lengthwise and cut crosswise into 1/4-inch pieces
1 onion, chopped
1 stalk of celery, chopped
3 garlic cloves, minced
1/2 teaspoon crumbled dried basil
1/2 teaspoon crumbled dried orégano
1/4 teaspoon crumbled dried thyme
1 bay leaf
1/2 cup minced fresh parsley leaves
2 tablespoons cider vinegar, or to taste
Preparation
In a heavy kettle combine the broth, 3 cups water, the lentils, the rice, the tomatoes with the reserved juice,
我想提取 Ingredients 和 Preparation 之间的数据。
我为此编写了以下正则表达式:-
(?s).*?Ingredients(.*?)Preparation.*
但是它在
file.txt和Preparation[=28=的第3行以斜体字提取Ingredients之间的数据] 但不在 Ingredients 和 Preparation
之间的数据之间
我应该对我的正则表达式代码做哪些更改来解决此问题?
提前致谢!
(?s).*?[*]{2}Ingredients[*]{2}(.*?)[*]{2}Preparation[*]{2}.*
[*]{2}
告诉正则表达式你想要列表中的一个字符(这里是单个 *
)两次 {2}
。
我更喜欢使用字符 类 而不是转义,我发现它们比这个更具可读性:
(?s).*?\*{2}Ingredients\*{2}(.*?)\*{2}Preparation\*{2}.*
根据您使用的语言,您可能还需要转义反斜杠。
您可以使用前瞻来检查每行不是 Ingredients
。通过这种方式,您将测试数量限制为仅行的开头(而不是测试每个字符):
(?m)^Ingredients\R((?:(?!Ingredients$).*\R)+?)Preparation$
图案详情:
(?m) # switch on the multiline mode (^ and $ match the limit of the line)
^Ingredients\R # "Ingredients" at the start of the line followed by a new line
( # capture group 1
(?: # open a non-capturing group
(?!Ingredients$) # negative lookahead to check that the line is not "Ingredients"
.*\R # the line
)+? # repeat until "Preparation"
)
Preparation$
注意:由于您没有说明您使用的是什么正则表达式引擎,因此可能 \R
不受支持。在这种情况下,将其替换为 \r?\n
。
您可以使用惰性量词 .*?
和第二个 .*
:
(?s).*\bIngredients\b(.*?)\bPreparation\b
见demo
或者你可以使用 tempered greedy token 然后你不需要第一个 .*
:
(?s)\bIngredients\b(?:(?!\b(?:Ingredients|Preparation)\b).)*\bPreparation\b
见demo
试着让你的第一个 .*
变得贪婪。它会吃掉所有 Ingredients
直到 Preparation
:
(?s).*Ingredients(.*?)Preparation.*