解析带有嵌套引号的字符串
Parsing a string with nested quotes
我需要解析如下所示的字符串:
"prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
我想得到如下列表:
['field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']
我已经用正则表达式试过了,但它在伪SQL语句的子字符串中不起作用。
我如何获得该列表?
有人指出你的字符串格式不正确,我用这个:
mystr = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
found = [a.replace("'", '').replace(',', '') for a in mystr.split(' ') if "'" in a]
哪个returns:
['field1',
'',
'field2',
'field3',
'select',
'2017)',
'(((literal1',
'literal2',
'literal3',
'literal4',
'literal5',
'literal6',
'literal7)',
'(literal8)',
'literal9)',
'',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10']
如果您知道 SQL 字符串应该是什么样子,这里有一个简单的方法。
我们匹配 SQL 字符串,并将其余字符串拆分为开始和结束字符串。
然后我们匹配更简单的字段模式并从头开始为该模式构建一个列表,添加回 SQL 匹配,然后是结束字符串中的字段。
sqlmatch = 'select .* LIMIT 0'
fieldmatch = "'(|\w+)'"
match = re.search(sqlmatch, mystring)
startstring = mystring[:match.start()]
sql = mystring[match.start():match.end()]
endstring = mystring[match.end():]
result = []
for found in re.findall(fieldmatch, startstring):
result.append(found)
result.append(sql)
for found in re.findall(fieldmatch, endstring):
result.append(found)
结果列表如下所示:
['field1',
'',
'field2',
'field3',
'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\')
OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\') LIMIT 0',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10']
如果字段的数量是固定的,你可以这样做:
def splitter(string):
strip_chars = "\"' "
string = string[len('prefix '):] # remove the prefix
left_parts = string.split(',', 4) # only split up to 4 times
for i in left_parts[:-1]:
yield i.strip(strip_chars) # return what we've found so far
right_parts = left_parts[-1].rsplit(',', 7) # only split up to 7 times starting from the right
for i in right_parts:
yield i.strip(strip_chars) # return the rest
mystr = """prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"""
result = list(splitter(mystr))
print(repr(result))
# result:
[
'field1',
'',
'field2',
'field3',
'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\') LIMIT 0',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10'
]
实际位于字段之间的逗号分隔符将处于偶数引号级别。因此,通过将这些逗号更改为 \n 字符,您可以对字符串应用简单的 .split("\n") 以获取字段值。然后您只需清理字段值以删除 leading/trailing 个空格和引号。
from itertools import accumulate
string = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
prefix,data = string.split(" ",1) # remove prefix
quoteLevels = accumulate( c == "'" for c in data ) # compute quote levels for each character
fieldData = "".join([ "\n" if c=="," and q%2 == 0 else c for c,q in zip(data,quoteLevels) ]) # comma to /n at even quote levels
fields = [ f.strip().strip("'") for f in fieldData.split("'\n '") ] # split and clean content
for i,field in enumerate(fields): print(i,field)
这将打印:
0 field1
1
2 field2
3 field3
4 select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0
5 field5
6 field6
7 field7
8 field8
9 field9
10
11 field10
由于字段数是固定的,非sql字段没有嵌入引号,所以有一个简单的三行解决方案:
prefix, other = string.partition(' ')[::2]
fields = other.strip('\'').split('\', \'')
fields[4:-7] = [''.join(fields[4:-7])]
print(fields)
输出:
['field1', '', 'field2', 'field3', "select ... where (column1 = '2017') and ((('literal1literal2literal3literal4literal5literal6literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ", 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']
我需要解析如下所示的字符串:
"prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
我想得到如下列表:
['field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']
我已经用正则表达式试过了,但它在伪SQL语句的子字符串中不起作用。
我如何获得该列表?
有人指出你的字符串格式不正确,我用这个:
mystr = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
found = [a.replace("'", '').replace(',', '') for a in mystr.split(' ') if "'" in a]
哪个returns:
['field1',
'',
'field2',
'field3',
'select',
'2017)',
'(((literal1',
'literal2',
'literal3',
'literal4',
'literal5',
'literal6',
'literal7)',
'(literal8)',
'literal9)',
'',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10']
如果您知道 SQL 字符串应该是什么样子,这里有一个简单的方法。
我们匹配 SQL 字符串,并将其余字符串拆分为开始和结束字符串。
然后我们匹配更简单的字段模式并从头开始为该模式构建一个列表,添加回 SQL 匹配,然后是结束字符串中的字段。
sqlmatch = 'select .* LIMIT 0'
fieldmatch = "'(|\w+)'"
match = re.search(sqlmatch, mystring)
startstring = mystring[:match.start()]
sql = mystring[match.start():match.end()]
endstring = mystring[match.end():]
result = []
for found in re.findall(fieldmatch, startstring):
result.append(found)
result.append(sql)
for found in re.findall(fieldmatch, endstring):
result.append(found)
结果列表如下所示:
['field1',
'',
'field2',
'field3',
'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\')
OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\') LIMIT 0',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10']
如果字段的数量是固定的,你可以这样做:
def splitter(string):
strip_chars = "\"' "
string = string[len('prefix '):] # remove the prefix
left_parts = string.split(',', 4) # only split up to 4 times
for i in left_parts[:-1]:
yield i.strip(strip_chars) # return what we've found so far
right_parts = left_parts[-1].rsplit(',', 7) # only split up to 7 times starting from the right
for i in right_parts:
yield i.strip(strip_chars) # return the rest
mystr = """prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"""
result = list(splitter(mystr))
print(repr(result))
# result:
[
'field1',
'',
'field2',
'field3',
'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\') LIMIT 0',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10'
]
实际位于字段之间的逗号分隔符将处于偶数引号级别。因此,通过将这些逗号更改为 \n 字符,您可以对字符串应用简单的 .split("\n") 以获取字段值。然后您只需清理字段值以删除 leading/trailing 个空格和引号。
from itertools import accumulate
string = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
prefix,data = string.split(" ",1) # remove prefix
quoteLevels = accumulate( c == "'" for c in data ) # compute quote levels for each character
fieldData = "".join([ "\n" if c=="," and q%2 == 0 else c for c,q in zip(data,quoteLevels) ]) # comma to /n at even quote levels
fields = [ f.strip().strip("'") for f in fieldData.split("'\n '") ] # split and clean content
for i,field in enumerate(fields): print(i,field)
这将打印:
0 field1
1
2 field2
3 field3
4 select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0
5 field5
6 field6
7 field7
8 field8
9 field9
10
11 field10
由于字段数是固定的,非sql字段没有嵌入引号,所以有一个简单的三行解决方案:
prefix, other = string.partition(' ')[::2]
fields = other.strip('\'').split('\', \'')
fields[4:-7] = [''.join(fields[4:-7])]
print(fields)
输出:
['field1', '', 'field2', 'field3', "select ... where (column1 = '2017') and ((('literal1literal2literal3literal4literal5literal6literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ", 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']