处理不同的转义序列?
Handling different escaping sequences?
我正在使用 ANTLR 和 Presto 语法来解析 SQL 查询。
这是我用来解析查询的原始字符串定义:
STRING
: '\'' ( '\' .
| ~[\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
这对大多数查询都有效,直到我看到具有不同转义规则的查询。例如:
select
table1(replace(replace(some_col,'\'',''),'\"' ,'')) as features
from table1
所以我修改了我的字符串定义,现在它看起来像:
STRING
: '\'' ( '\' .
| '\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
但是,这对我上面提到的查询不起作用
'\'',''),'
作为单个字符串。
对于以下查询,谓词 returns 为真。
知道我该如何处理这个查询吗?
谢谢,
尼尔
最后我解决了。这是我使用的表达式:
STRING
: '\'' ( '\\' . {HelperUtils.isNeedSpecialEscaping(this)}?
| '\' (~[\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
| ~[\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
grammar Question;
sql
@init {System.out.println("Question last update 2352");}
: replace+ EOF
;
replace
: REPLACE '(' expr ')'
;
expr
: ( replace | ID ) ',' STRING ',' STRING
;
REPLACE : 'replace' DIGIT? ;
ID : [a-zA-Z0-9_]+ ;
DIGIT : [0-9] ;
STRING : '\'' '\\\'' '\'' // '\''
| '\'' '\'\'' '\'' // ''''
| '\'' ~[\']* '\'\'' ~[\']* '\'' // 'it is 8 o''clock'
| '\'' .*? '\'' ;
NL : '\r'? '\n' -> channel(HIDDEN) ;
WS : [ \t]+ -> channel(HIDDEN) ;
文件input.txt
(没有更多例子,我只能猜测):
replace1(replace(some_col,'\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')
执行:
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt
[@0,0:7='replace1',<REPLACE>,1:0]
[@1,8:8='(',<'('>,1:8]
[@2,9:15='replace',<REPLACE>,1:9]
[@3,16:16='(',<'('>,1:16]
[@4,17:24='some_col',<ID>,1:17]
[@5,25:25=',',<','>,1:25]
[@6,26:30=''\''',<STRING>,1:26]
[@7,31:31=',',<','>,1:31]
[@8,32:33='''',<STRING>,1:32]
[@9,34:34=')',<')'>,1:34]
[@10,35:35=',',<','>,1:35]
[@11,36:39=''\"'',<STRING>,1:36]
[@12,40:40=' ',<WS>,channel=1,1:40]
[@13,41:41=',',<','>,1:41]
[@14,42:43='''',<STRING>,1:42]
[@15,44:44=')',<')'>,1:44]
[@16,45:45='\n',<NL>,channel=1,1:45]
[@17,46:53='replace2',<REPLACE>,2:0]
[@18,54:54='(',<'('>,2:8]
[@19,55:62='some_col',<ID>,2:9]
[@20,63:63=',',<','>,2:17]
[@21,64:67='''''',<STRING>,2:18]
[@22,68:68=',',<','>,2:22]
[@23,69:70='''',<STRING>,2:23]
[@24,71:71=')',<')'>,2:25]
[@25,72:72='\n',<NL>,channel=1,2:26]
[@26,73:80='replace3',<REPLACE>,3:0]
[@27,81:81='(',<'('>,3:8]
[@28,82:89='some_col',<ID>,3:9]
[@29,90:90=',',<','>,3:17]
[@30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[@31,106:106=',',<','>,3:33]
[@32,107:111=''xyz'',<STRING>,3:34]
[@33,112:112=')',<')'>,3:39]
[@34,113:113='\n',<NL>,channel=1,3:40]
[@35,114:121='replace4',<REPLACE>,4:0]
[@36,122:122='(',<'('>,4:8]
[@37,123:130='some_col',<ID>,4:9]
[@38,131:131=',',<','>,4:17]
[@39,132:141=''abc\ndef'',<STRING>,4:18]
[@40,142:142=',',<','>,4:28]
[@41,143:147=''xyz'',<STRING>,4:29]
[@42,148:148=')',<')'>,4:34]
[@43,149:149='\n',<NL>,channel=1,4:35]
[@44,150:157='replace5',<REPLACE>,5:0]
[@45,158:158='(',<'('>,5:8]
[@46,159:166='some_col',<ID>,5:9]
[@47,167:167=',',<','>,5:17]
[@48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[@49,186:186=',',<','>,5:36]
[@50,187:189=''8'',<STRING>,5:37]
[@51,190:190=')',<')'>,5:40]
[@52,191:191='\n',<NL>,channel=1,5:41]
[@53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352
我正在使用 ANTLR 和 Presto 语法来解析 SQL 查询。 这是我用来解析查询的原始字符串定义:
STRING
: '\'' ( '\' .
| ~[\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
这对大多数查询都有效,直到我看到具有不同转义规则的查询。例如:
select
table1(replace(replace(some_col,'\'',''),'\"' ,'')) as features
from table1
所以我修改了我的字符串定义,现在它看起来像:
STRING
: '\'' ( '\' .
| '\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
但是,这对我上面提到的查询不起作用
'\'',''),'
作为单个字符串。 对于以下查询,谓词 returns 为真。 知道我该如何处理这个查询吗?
谢谢, 尼尔
最后我解决了。这是我使用的表达式:
STRING
: '\'' ( '\\' . {HelperUtils.isNeedSpecialEscaping(this)}?
| '\' (~[\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
| ~[\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
grammar Question;
sql
@init {System.out.println("Question last update 2352");}
: replace+ EOF
;
replace
: REPLACE '(' expr ')'
;
expr
: ( replace | ID ) ',' STRING ',' STRING
;
REPLACE : 'replace' DIGIT? ;
ID : [a-zA-Z0-9_]+ ;
DIGIT : [0-9] ;
STRING : '\'' '\\\'' '\'' // '\''
| '\'' '\'\'' '\'' // ''''
| '\'' ~[\']* '\'\'' ~[\']* '\'' // 'it is 8 o''clock'
| '\'' .*? '\'' ;
NL : '\r'? '\n' -> channel(HIDDEN) ;
WS : [ \t]+ -> channel(HIDDEN) ;
文件input.txt
(没有更多例子,我只能猜测):
replace1(replace(some_col,'\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')
执行:
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt
[@0,0:7='replace1',<REPLACE>,1:0]
[@1,8:8='(',<'('>,1:8]
[@2,9:15='replace',<REPLACE>,1:9]
[@3,16:16='(',<'('>,1:16]
[@4,17:24='some_col',<ID>,1:17]
[@5,25:25=',',<','>,1:25]
[@6,26:30=''\''',<STRING>,1:26]
[@7,31:31=',',<','>,1:31]
[@8,32:33='''',<STRING>,1:32]
[@9,34:34=')',<')'>,1:34]
[@10,35:35=',',<','>,1:35]
[@11,36:39=''\"'',<STRING>,1:36]
[@12,40:40=' ',<WS>,channel=1,1:40]
[@13,41:41=',',<','>,1:41]
[@14,42:43='''',<STRING>,1:42]
[@15,44:44=')',<')'>,1:44]
[@16,45:45='\n',<NL>,channel=1,1:45]
[@17,46:53='replace2',<REPLACE>,2:0]
[@18,54:54='(',<'('>,2:8]
[@19,55:62='some_col',<ID>,2:9]
[@20,63:63=',',<','>,2:17]
[@21,64:67='''''',<STRING>,2:18]
[@22,68:68=',',<','>,2:22]
[@23,69:70='''',<STRING>,2:23]
[@24,71:71=')',<')'>,2:25]
[@25,72:72='\n',<NL>,channel=1,2:26]
[@26,73:80='replace3',<REPLACE>,3:0]
[@27,81:81='(',<'('>,3:8]
[@28,82:89='some_col',<ID>,3:9]
[@29,90:90=',',<','>,3:17]
[@30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[@31,106:106=',',<','>,3:33]
[@32,107:111=''xyz'',<STRING>,3:34]
[@33,112:112=')',<')'>,3:39]
[@34,113:113='\n',<NL>,channel=1,3:40]
[@35,114:121='replace4',<REPLACE>,4:0]
[@36,122:122='(',<'('>,4:8]
[@37,123:130='some_col',<ID>,4:9]
[@38,131:131=',',<','>,4:17]
[@39,132:141=''abc\ndef'',<STRING>,4:18]
[@40,142:142=',',<','>,4:28]
[@41,143:147=''xyz'',<STRING>,4:29]
[@42,148:148=')',<')'>,4:34]
[@43,149:149='\n',<NL>,channel=1,4:35]
[@44,150:157='replace5',<REPLACE>,5:0]
[@45,158:158='(',<'('>,5:8]
[@46,159:166='some_col',<ID>,5:9]
[@47,167:167=',',<','>,5:17]
[@48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[@49,186:186=',',<','>,5:36]
[@50,187:189=''8'',<STRING>,5:37]
[@51,190:190=')',<')'>,5:40]
[@52,191:191='\n',<NL>,channel=1,5:41]
[@53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352