使用 Python RegEx re.findall 解析文本

Parsing Text with Python RegEx re.findall

我有一个很长的字符串,需要分组解析,但需要多加控制

import re

RAW_Data = "Name Multiple Words Testing With 1234 Numbers and this stuff* ((Bla Bla Bla (Bla Bla) A40 & A41)) Name Multiple Words Testing With 3456 Numbers and this stuff2* ((Bla Bla Bla (Bla Bla) A42 & A43)) Name Multiple Words Testing With 78910 Numbers and this stuff3* ((Bla Bla Bla (Bla Bla) A44 & A45)) Name Multiple Words Testing With 1234 Numbers and this stuff4* ((Bla Bla Bla (Bla Bla) A46 & A47)) Name Multiple Words Testing With 1234 Numbers and this stuff5* ((Bla Bla Bla (Bla Bla) A48 & A49)) Name Multiple Words Testing With 1234 Numbers and this stuff6* ((Bla Bla Bla (Bla Bla) A50 & A51)) Name Multiple Words Testing With 1234 Numbers and this stuff7* ((Bla Bla Bla (Bla Bla) A52 & A53)) Name Multiple Words Testing With 1234 Numbers and this stuff8* ((Bla Bla Bla (Bla Bla) A54 & A55)) Name Multiple Words Testing With 1234 Numbers and this stuff9* ((Bla Bla Bla (Bla Bla) A56 & A57)) Name Multiple Words Testing With 1234 Numbers and this stuff10* ((Bla Bla Bla (Bla Bla) A58 & A59)) Name Multiple Words Testing With 1234 Numbers and this stuff11* ((Bla Bla Bla (Bla Bla) A60 & A61)) Name Multiple Words Testing With 1234 Numbers and this stuff12* ((Bla Bla Bla (Bla Bla) A62 & A63)) Name Multiple Words Testing With 1234 Numbers and this stuff13* ((Bla Bla Bla (Bla Bla) A64 & A65)) Name Multiple Words Testing With 1234 Numbers and this stuff14* ((Bla Bla Bla (Bla Bla) A66 & A67)) Name Multiple Words Testing With 1234 Numbers and this stuff15* ((Bla Bla Bla (Bla Bla) A68 & A69)) Name Multiple Words Testing With 1234 Numbers and this stuff16*"

fromnode = re.findall('(.*?)(?=\*\s)', RAW_Data)

print fromnode

del fromnode
del RAW_Data

结果是:'Name Multiple Words Testing With 1234 Numbers and this stuff','','((Bla Bla Bla (Bla Bla) A40 & A41)) 用 3456 个数字和这个东西进行命名多个单词测试2' 。 .......等等。

我似乎无法只捕获像 "Name Multiple Words Testing With 3456 Numbers and this stuff" 这样的字符串而忽略像“((Bla Bla Bla (Bla Bla) A40 & A41))”这样的所有字符串。任何帮助将不胜感激。

你可以拆分

r'\*\s*\({2}.*?\){2}\s*'

模式(see demo)匹配:

  • \* - 文字星号
  • \s* - 零个或多个空格
  • \({2} - 恰好 2 个左括号
  • .*? - 除换行符以外的零个或多个字符(注意:如果需要匹配多行,请添加 re.S 标志)尽可能少,直到第一行
  • \){2} - 双右括号
  • \s* - 0+ 空格。

另外:same, but unrolled (thus, a bit more efficient) regex

\*\s*\({2}[^)]*(?:\)(?!\))[^)]*)*\){2}\s*

IDEONE demo:

import re
p = re.compile(r'\*\s*\({2}.*?\){2}\s*')
test_str = "Name Multiple Words Testing With 1234 Numbers and this stuff* ((Bla Bla Bla (Bla Bla) A40 & A41)) Name Multiple Words Testing With 3456 Numbers and this stuff2* ((Bla Bla Bla (Bla Bla) A42 & A43)) Name Multiple Words Testing With 78910 Numbers and this stuff3* ((Bla Bla Bla (Bla Bla) A44 & A45)) Name Multiple Words Testing With 1234 Numbers and this stuff4* ((Bla Bla Bla (Bla Bla) A46 & A47)) Name Multiple Words Testing With 1234 Numbers and this stuff5* ((Bla Bla Bla (Bla Bla) A48 & A49)) Name Multiple Words Testing With 1234 Numbers and this stuff6* ((Bla Bla Bla (Bla Bla) A50 & A51)) Name Multiple Words Testing With 1234 Numbers and this stuff7* ((Bla Bla Bla (Bla Bla) A52 & A53)) Name Multiple Words Testing With 1234 Numbers and this stuff8* ((Bla Bla Bla (Bla Bla) A54 & A55)) Name Multiple Words Testing With 1234 Numbers and this stuff9* ((Bla Bla Bla (Bla Bla) A56 & A57)) Name Multiple Words Testing With 1234 Numbers and this stuff10* ((Bla Bla Bla (Bla Bla) A58 & A59)) Name Multiple Words Testing With 1234 Numbers and this stuff11* ((Bla Bla Bla (Bla Bla) A60 & A61)) Name Multiple Words Testing With 1234 Numbers and this stuff12* ((Bla Bla Bla (Bla Bla) A62 & A63)) Name Multiple Words Testing With 1234 Numbers and this stuff13* ((Bla Bla Bla (Bla Bla) A64 & A65)) Name Multiple Words Testing With 1234 Numbers and this stuff14* ((Bla Bla Bla (Bla Bla) A66 & A67)) Name Multiple Words Testing With 1234 Numbers and this stuff15* ((Bla Bla Bla (Bla Bla) A68 & A69)) Name Multiple Words Testing With 1234 Numbers and this stuff16*"
print(re.split(p, test_str))

更新

用于 re.findall 的正则表达式:

(?:\*\s*\(\([^)]*(?:\)(?!\))[^)]*)*\)\))?\s*([^*]*(?:\*(?!\s*\(\()[^*]*)*)\s*

regex demo

被它的样子吓坏了?它只是简单得多的 (?:\*\s*\(\(.*?\)\))?\s*(.*?(?=\*\s*(?:\(\(|$))).

的展开版本

参见IDEONE demo