如何提取正则表达式查询直到特定单词？

Question

我正在尝试从 LookML（一种特定的标记语言）中提取某些数据。如果这是示例代码：

explore: explore_name {}
explore: explore_name1 {
  label: "name"
  join: view_name {
      relationship: many_to_one
      type: inner
      sql_on: ${activity_type.activity_name}=${activity_type.activity_name} ;;
  }
}
explore: explore_name3 {}

然后我会收到一个如下所示的列表：

explore: character_balance {}

label: "name"
join: activity_type {
  relationship: many_to_one
  type: inner
  sql_on: ${activity_type.activity_name}=${activity_type.activity_name} ;;
}```

explore: explore_name4 {}

本质上，我在 "explore" 开始匹配，并在找到另一个 "explore" 时结束匹配 -然后将开始下一场比赛。

这是我之前的内容，它匹配所有行，直到找到 ;，并且效果非常好：'explore:\s[^;]*'。但是，假设有一个，这会在“;”处停止。

我要如何更改它才能删除 'explore' 和 'explore' 之间的所有内容？简单地替换';'在我的 'explore' 正则表达式中，只要它找到一个与 [e,x,p,l,o,r,e] 中的任何内容匹配的字母，它就会停止 - 这不是我想要的行为。删除方括号和 ^ 最终会破坏所有内容，因此它无法跨多行查询。

我应该在这里做什么？

Answer 1

您可以使用带有先行断言的 non-greedy 匹配来检查是否存在另一个 explore: 或字符串的结尾。尝试：

'explore:.*?(?=explore|$)'

Answer 2

虽然在 Regex 中是 do-able，但您应该使用理解格式的解析器，因为 Regex 解决方案非常脆弱。

话虽如此，这里有一个使用 DOTALL 模式的正则表达式解决方案（其中 . 匹配任何字符，包括换行符) 已启用：

re.findall(r'explore:.*?\}', text, re.DOTALL)

explore: 字面匹配
.*?\} non-greedily 匹配到下一个 }

示例：

In [1253]: text = '''explore: character_balance {} 
      ...: explore: tower_ends { 
      ...:   label: "Tower Results" 
      ...:   join: activity_type { 
      ...:       relationship: many_to_one 
      ...:       type: inner 
      ...:       sql_on: ${activity_type.activity_name}=${wba_fact_activity.activity_name} ;; 
      ...:   } 
      ...: } 
      ...: explore: seven11_core_session_start {}'''                                                                                                                                                        

In [1254]: re.findall(r'explore:.*?\}', text, re.DOTALL)                                                                                                                                     
Out[1254]: 
['explore: character_balance {}',
 'explore: tower_ends {\n  label: "Tower Results"\n  join: activity_type {\n      relationship: many_to_one\n      type: inner\n      sql_on: ${activity_type.activity_name}',
 'explore: seven11_core_session_start {}']

Answer 3

一种天真的方法是到达下一个 "explore" 单词。但是如果出于某种原因，一个字符串值包含这个词，您将得到错误的结果。如果您在字符串包含嵌套括号时尝试停止使用大括号，也会出现同样的问题。

这就是为什么我建议对包含帐户字符串和嵌套大括号的字符串语法进行更精确的描述。由于 re 模块没有递归功能（处理嵌套结构），我将使用 pypi/regex 模块代替：

import regex

pat = r'''(?xms)
    \b explore:
    [^\S\r\n]* # optional horizontal whitespaces
    [^\n{]* # possible content of the same line
    # followed by two possibilities
    (?: # the content stops at the end of the line with a ;
        ; [^\S\r\n]* $
      | # or it contains curly brackets and spreads over eventually multiple lines
        ( # group 1
            {
                [^{}"]*+ # all that isn't curly brackets nor double quotes
                (?:
                    " [^\"]*+ (?: \. [^\"]* )*+ " # contents between quotes
                    [^{}"]*

                  |
                    (?1) # nested curly brackets, recursion in the group 1
                    [^{}"]*
                )*+
            }
        )
    )'''

results = [x.group(0) for x in regex.finditer(pat, yourstring)]

demo

为了更严格，您可以添加对单引号字符串的支持，并防止模式开头的 "explore:" 不在使用 (*SKIP)(*FAIL) 构造的字符串中。

如何提取正则表达式查询直到特定单词？

How to extract regex query until a specific word?

python

regex

regex-lookarounds

looker