为什么我的 Prolog S-expression 分词器在其基本情况下失败?
Why does my Prolog S-expression tokenizer fail on its base case?
为了学习一些 Prolog(我正在使用 GNU Prolog)并了解它的解析能力,我开始编写一个 Lisp(或 S-expression,如果我是准确的话)分词器,它给出了一组像 ['(', 'f', 'o', 'o', ')']
这样的标记应该产生 ['(', 'foo', ')']
。它没有按预期工作,这就是我来这里的原因!我认为我的思维过程在我的伪代码中得到了体现:
tokenize([current | rest], buffer, tokens):
if current is '(' or ')',
Tokenize the rest,
And the output will be the current token buffer,
Plus the parenthesis and the rest.
if current is ' ',
Tokenize the rest with a clean buffer,
And the output will be the buffer plus the rest.
if the tail is empty,
The output will be a one-element list containing the buffer.
otherwise,
Add the current character to the buffer,
And the output will be the rest tokenized, with a bigger buffer.
我把它翻译成 Prolog 是这样的:
tokenize([Char | Chars], Buffer, Tokens) :-
((Char = '(' ; Char = ')') ->
tokenize(Chars, '', Tail_Tokens),
Tokens is [Buffer, Char | Tail_Tokens];
Char = ' ' ->
tokenize(Chars, '', Tail_Tokens),
Tokens is [Buffer | Tail_Tokens];
Chars = [] -> Tokens is [Buffer];
atom_concat(Buffer, Char, New_Buffer),
tokenize(Chars, New_Buffer, Tokens)).
print_tokens([]) :- write('.').
print_tokens([T | N]) :- write(T), write(', '), print_tokens(N).
main :-
% tokenize(['(', 'f', 'o', 'o', '(', 'b', 'a', 'r', ')', 'b', 'a', 'z', ')'], '', Tokens),
tokenize(['(', 'f', 'o', 'o', ')'], '', Tokens),
print_tokens(Tokens).
当 运行 结果时,如下所示: gprolog --consult-file lisp_parser.pl
它只是告诉我 no
。我跟踪了 main
,它给了我下面的堆栈跟踪。我不明白为什么 tokenize
对空案例失败。我看到缓冲区是空的,因为它被前面的 ')'
清除了,但是即使 Tokens
在那个时间点是空的, Tokens
不会递归地累积更大的结果吗?有懂 Prolog 的人可以给我一些提示吗?
| ?- main.
no
| ?- trace.
The debugger will first creep -- showing everything (trace)
(1 ms) yes
{trace}
| ?- main.
1 1 Call: main ?
2 2 Call: tokenize(['(',f,o,o,')'],'',_353) ?
3 3 Call: tokenize([f,o,o,')'],'',_378) ?
4 4 Call: atom_concat('',f,_403) ?
4 4 Exit: atom_concat('',f,f) ?
5 4 Call: tokenize([o,o,')'],f,_429) ?
6 5 Call: atom_concat(f,o,_454) ?
6 5 Exit: atom_concat(f,o,fo) ?
7 5 Call: tokenize([o,')'],fo,_480) ?
8 6 Call: atom_concat(fo,o,_505) ?
8 6 Exit: atom_concat(fo,o,foo) ?
9 6 Call: tokenize([')'],foo,_531) ?
10 7 Call: tokenize([],'',_556) ?
10 7 Fail: tokenize([],'',_544) ?
9 6 Fail: tokenize([')'],foo,_519) ?
7 5 Fail: tokenize([o,')'],fo,_468) ?
5 4 Fail: tokenize([o,o,')'],f,_417) ?
3 3 Fail: tokenize([f,o,o,')'],'',_366) ?
2 2 Fail: tokenize(['(',f,o,o,')'],'',_341) ?
1 1 Fail: main ?
(1 ms) no
{trace}
| ?-
这个怎么样。我想这就是你想要做的,但让我们使用 Definite Clause Grammars(这只是用 -->
替换 :-
的喇叭子句和两个省略的参数输入字符列表和剩余字符列表。示例DCG规则:
rule(X) --> [c], another_rule(X), {predicate(X)}.
列表处理规则rule//1
说:当你在输入列表中找到字符c
,然后用another_rule//1
继续列表处理,当成功时,调用[=18] =] 正常。
然后:
% If we encounter a separator symbol '(' or ')', we commit to the
% clause using '!' (no point trying anything else, in particular
% not the clause for "other characters", tokenize the rest of the list,
% and when we have done that decide whether 'MaybeToken', which is
% "part of the leftmost token after '(' or ')'", should be retained.
% it is dropped if it is empty. The caller is then given an empty
% "part of the leftmost token" and the list of tokens, with '(' or ')'
% prepended: "tokenize('', [ '(' | MoreTokens] ) -->"
tokenize('', [ '(' | MoreTokens] ) -->
['('],
!,
tokenize(MaybeToken,Tokens),
{drop_empty(MaybeToken,Tokens,MoreTokens)}.
tokenize('',[')'|MoreTokens]) -->
[')'],
!,
tokenize(MaybeToken,Tokens),
{drop_empty(MaybeToken,Tokens,MoreTokens)}.
% No more characters in the input list (that's what '--> []' says).
% We succeed, with an empty token list and an empty buffer fro the
% leftmost token.
tokenize('',[]) --> [].
% If we find a 'Ch' that is not '(' or ')', then tokenize
% more of the list via 'tokenize(MaybeToken,Tokens)'. On
% returns 'MaybeToken' is a piece of the leftmost token found
% in that list, so we have to stick 'Ch' onto its start.
tokenize(LargerMaybeToken,Tokens) -->
[Ch],
tokenize(MaybeToken,Tokens),
{atom_concat(Ch,MaybeToken,LargerMaybeToken)}.
% ---
% This drops an empty "MaybeToken". If "MaybeToken" is
% *not* empty, it is actually a token and prepended to the list "Tokens"
% ---
drop_empty('',Tokens,Tokens) :- !.
drop_empty(MaybeToken,Tokens,[MaybeToken|Tokens]).
% -----------------
% Call the DCG using phrase/2
% -----------------
tokenize(Text,Result) :-
phrase( tokenize(MaybeToken,Tokens), Text ),
drop_empty(MaybeToken,Tokens,Result),!.
等等:
?- tokenize([h,e,l,l,o],R).
R = [hello].
?- tokenize([h,e,l,'(',l,')',o],R).
R = [hel,(,l,),o].
?- tokenize([h,e,l,'(',l,l,')',o],R).
R = [hel,(,ll,),o].
我认为在 GNU Prolog 中,符号 `hello` 直接生成 [h,e,l,l,o]
。
I do not understand why tokenize
fails for the empty case.
Prolog 中任何事情失败的原因是因为没有使它为真的子句。如果 tokenize
的唯一子句是 tokenize([Char | Chars], ...)
形式,那么 tokenize([], ...)
形式的调用将永远无法匹配此子句,并且由于没有其他子句,调用会失败。
所以你需要添加这样一个子句。但首先:
:- set_prolog_flag(double_quotes, chars).
这允许您将 ['(', f, o, o, ')']
写成 "foo"
。
此外,您必须计划输入完全为空的情况,或者您可能必须为缓冲区发出令牌的其他情况,但前提是它不是 ''
(因为应该有结果中没有 ''
个标记。
finish_buffer(Tokens, Buffer, TokensMaybeWithBuffer) :-
( Buffer = ''
-> TokensMaybeWithBuffer = Tokens
; TokensMaybeWithBuffer = [Buffer | Tokens] ).
例如:
?- finish_buffer(MyTokens, '', TokensMaybeWithBuffer).
MyTokens = TokensMaybeWithBuffer.
?- finish_buffer(MyTokens, 'foo', TokensMaybeWithBuffer).
TokensMaybeWithBuffer = [foo|MyTokens].
请注意,您可以将缓冲区添加到标记列表之前,即使您还不知道标记列表是什么!这就是逻辑变量的力量。其余代码也使用了这种技术。
因此,空输入的情况:
tokenize([], Buffer, Tokens) :-
finish_buffer([], Buffer, Tokens).
例如:
?- tokenize([], '', Tokens).
Tokens = [].
?- tokenize([], 'foo', Tokens).
Tokens = [foo].
其余案例:
tokenize([Parenthesis | Chars], Buffer, TokensWithParenthesis) :-
( Parenthesis = '('
; Parenthesis = ')' ),
finish_buffer([Parenthesis | Tokens], Buffer, TokensWithParenthesis),
tokenize(Chars, '', Tokens).
tokenize([' ' | Chars], Buffer, TokensWithBuffer) :-
finish_buffer(Tokens, Buffer, TokensWithBuffer),
tokenize(Chars, '', Tokens).
tokenize([Char | Chars], Buffer, Tokens) :-
Char \= '(',
Char \= ')',
Char \= ' ',
atom_concat(Buffer, Char, NewBuffer),
tokenize(Chars, NewBuffer, Tokens).
请注意我是如何为不同的情况使用不同的子句的。这使代码更具可读性,但与 (... -> ... ; ...)
相比,它确实有一个缺点,即最后一个子句必须排除前面子句处理的字符。一旦你的代码变成了这种形式,并且你很高兴它能工作,你可以使用 (... -> ... ; ...)
将它转换成一种形式,如果你真的想要的话。
示例:
?- tokenize("(foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.
?- tokenize(" (foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.
?- tokenize("(foo(bar)baz)", '', Tokens).
Tokens = ['(', foo, '(', bar, ')', baz, ')'] ;
false.
最后,非常重要,is
运算符的意思是仅 用于计算算术表达式。当您将它应用于任何非算术运算时,它会抛出异常。合一不同于算术表达式的求值。统一写成=
.
?- X is 2 + 2.
X = 4.
?- X = 2 + 2.
X = 2+2.
?- X is [a, b, c].
ERROR: Arithmetic: `[a,b,c]' is not a function
ERROR: In:
ERROR: [20] throw(error(type_error(evaluable,...),_3362))
ERROR: [17] arithmetic:expand_function([a,b|...],_3400,_3402) at /usr/lib/swi-prolog/library/arithmetic.pl:175
ERROR: [16] arithmetic:math_goal_expansion(_3450 is [a|...],_3446) at /usr/lib/swi-prolog/library/arithmetic.pl:147
ERROR: [14] '$expand':call_goal_expansion([system- ...],_3512 is [a|...],_3492,_3494,_3496) at /usr/lib/swi-prolog/boot/expand.pl:863
ERROR: [13] '$expand':expand_goal(_3566 is [a|...],_3552,_3554,_3556,user,[system- ...],_3562) at /usr/lib/swi-prolog/boot/expand.pl:524
ERROR: [12] setup_call_catcher_cleanup('$expand':'$set_source_module'(user,user),'$expand':expand_goal(...,_3640,_3642,_3644,user,...,_3650),_3614,'$expand':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR: [8] '$expand':expand_goal(user:(_3706 is ...),_3692,user:_3714,_3696) at /usr/lib/swi-prolog/boot/expand.pl:458
ERROR: [6] setup_call_catcher_cleanup('$toplevel':'$set_source_module'(user,user),'$toplevel':expand_goal(...,...),_3742,'$toplevel':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR:
ERROR: Note: some frames are missing due to last-call optimization.
ERROR: Re-run your program in debug mode (:- debug.) to get more detail.
^ Call: (14) call('$expand':'$set_source_module'(user)) ? abort
% Execution Aborted
?- X = [a, b, c].
X = [a, b, c].
为了学习一些 Prolog(我正在使用 GNU Prolog)并了解它的解析能力,我开始编写一个 Lisp(或 S-expression,如果我是准确的话)分词器,它给出了一组像 ['(', 'f', 'o', 'o', ')']
这样的标记应该产生 ['(', 'foo', ')']
。它没有按预期工作,这就是我来这里的原因!我认为我的思维过程在我的伪代码中得到了体现:
tokenize([current | rest], buffer, tokens):
if current is '(' or ')',
Tokenize the rest,
And the output will be the current token buffer,
Plus the parenthesis and the rest.
if current is ' ',
Tokenize the rest with a clean buffer,
And the output will be the buffer plus the rest.
if the tail is empty,
The output will be a one-element list containing the buffer.
otherwise,
Add the current character to the buffer,
And the output will be the rest tokenized, with a bigger buffer.
我把它翻译成 Prolog 是这样的:
tokenize([Char | Chars], Buffer, Tokens) :-
((Char = '(' ; Char = ')') ->
tokenize(Chars, '', Tail_Tokens),
Tokens is [Buffer, Char | Tail_Tokens];
Char = ' ' ->
tokenize(Chars, '', Tail_Tokens),
Tokens is [Buffer | Tail_Tokens];
Chars = [] -> Tokens is [Buffer];
atom_concat(Buffer, Char, New_Buffer),
tokenize(Chars, New_Buffer, Tokens)).
print_tokens([]) :- write('.').
print_tokens([T | N]) :- write(T), write(', '), print_tokens(N).
main :-
% tokenize(['(', 'f', 'o', 'o', '(', 'b', 'a', 'r', ')', 'b', 'a', 'z', ')'], '', Tokens),
tokenize(['(', 'f', 'o', 'o', ')'], '', Tokens),
print_tokens(Tokens).
当 运行 结果时,如下所示: gprolog --consult-file lisp_parser.pl
它只是告诉我 no
。我跟踪了 main
,它给了我下面的堆栈跟踪。我不明白为什么 tokenize
对空案例失败。我看到缓冲区是空的,因为它被前面的 ')'
清除了,但是即使 Tokens
在那个时间点是空的, Tokens
不会递归地累积更大的结果吗?有懂 Prolog 的人可以给我一些提示吗?
| ?- main.
no
| ?- trace.
The debugger will first creep -- showing everything (trace)
(1 ms) yes
{trace}
| ?- main.
1 1 Call: main ?
2 2 Call: tokenize(['(',f,o,o,')'],'',_353) ?
3 3 Call: tokenize([f,o,o,')'],'',_378) ?
4 4 Call: atom_concat('',f,_403) ?
4 4 Exit: atom_concat('',f,f) ?
5 4 Call: tokenize([o,o,')'],f,_429) ?
6 5 Call: atom_concat(f,o,_454) ?
6 5 Exit: atom_concat(f,o,fo) ?
7 5 Call: tokenize([o,')'],fo,_480) ?
8 6 Call: atom_concat(fo,o,_505) ?
8 6 Exit: atom_concat(fo,o,foo) ?
9 6 Call: tokenize([')'],foo,_531) ?
10 7 Call: tokenize([],'',_556) ?
10 7 Fail: tokenize([],'',_544) ?
9 6 Fail: tokenize([')'],foo,_519) ?
7 5 Fail: tokenize([o,')'],fo,_468) ?
5 4 Fail: tokenize([o,o,')'],f,_417) ?
3 3 Fail: tokenize([f,o,o,')'],'',_366) ?
2 2 Fail: tokenize(['(',f,o,o,')'],'',_341) ?
1 1 Fail: main ?
(1 ms) no
{trace}
| ?-
这个怎么样。我想这就是你想要做的,但让我们使用 Definite Clause Grammars(这只是用 -->
替换 :-
的喇叭子句和两个省略的参数输入字符列表和剩余字符列表。示例DCG规则:
rule(X) --> [c], another_rule(X), {predicate(X)}.
列表处理规则rule//1
说:当你在输入列表中找到字符c
,然后用another_rule//1
继续列表处理,当成功时,调用[=18] =] 正常。
然后:
% If we encounter a separator symbol '(' or ')', we commit to the
% clause using '!' (no point trying anything else, in particular
% not the clause for "other characters", tokenize the rest of the list,
% and when we have done that decide whether 'MaybeToken', which is
% "part of the leftmost token after '(' or ')'", should be retained.
% it is dropped if it is empty. The caller is then given an empty
% "part of the leftmost token" and the list of tokens, with '(' or ')'
% prepended: "tokenize('', [ '(' | MoreTokens] ) -->"
tokenize('', [ '(' | MoreTokens] ) -->
['('],
!,
tokenize(MaybeToken,Tokens),
{drop_empty(MaybeToken,Tokens,MoreTokens)}.
tokenize('',[')'|MoreTokens]) -->
[')'],
!,
tokenize(MaybeToken,Tokens),
{drop_empty(MaybeToken,Tokens,MoreTokens)}.
% No more characters in the input list (that's what '--> []' says).
% We succeed, with an empty token list and an empty buffer fro the
% leftmost token.
tokenize('',[]) --> [].
% If we find a 'Ch' that is not '(' or ')', then tokenize
% more of the list via 'tokenize(MaybeToken,Tokens)'. On
% returns 'MaybeToken' is a piece of the leftmost token found
% in that list, so we have to stick 'Ch' onto its start.
tokenize(LargerMaybeToken,Tokens) -->
[Ch],
tokenize(MaybeToken,Tokens),
{atom_concat(Ch,MaybeToken,LargerMaybeToken)}.
% ---
% This drops an empty "MaybeToken". If "MaybeToken" is
% *not* empty, it is actually a token and prepended to the list "Tokens"
% ---
drop_empty('',Tokens,Tokens) :- !.
drop_empty(MaybeToken,Tokens,[MaybeToken|Tokens]).
% -----------------
% Call the DCG using phrase/2
% -----------------
tokenize(Text,Result) :-
phrase( tokenize(MaybeToken,Tokens), Text ),
drop_empty(MaybeToken,Tokens,Result),!.
等等:
?- tokenize([h,e,l,l,o],R).
R = [hello].
?- tokenize([h,e,l,'(',l,')',o],R).
R = [hel,(,l,),o].
?- tokenize([h,e,l,'(',l,l,')',o],R).
R = [hel,(,ll,),o].
我认为在 GNU Prolog 中,符号 `hello` 直接生成 [h,e,l,l,o]
。
I do not understand why
tokenize
fails for the empty case.
Prolog 中任何事情失败的原因是因为没有使它为真的子句。如果 tokenize
的唯一子句是 tokenize([Char | Chars], ...)
形式,那么 tokenize([], ...)
形式的调用将永远无法匹配此子句,并且由于没有其他子句,调用会失败。
所以你需要添加这样一个子句。但首先:
:- set_prolog_flag(double_quotes, chars).
这允许您将 ['(', f, o, o, ')']
写成 "foo"
。
此外,您必须计划输入完全为空的情况,或者您可能必须为缓冲区发出令牌的其他情况,但前提是它不是 ''
(因为应该有结果中没有 ''
个标记。
finish_buffer(Tokens, Buffer, TokensMaybeWithBuffer) :-
( Buffer = ''
-> TokensMaybeWithBuffer = Tokens
; TokensMaybeWithBuffer = [Buffer | Tokens] ).
例如:
?- finish_buffer(MyTokens, '', TokensMaybeWithBuffer).
MyTokens = TokensMaybeWithBuffer.
?- finish_buffer(MyTokens, 'foo', TokensMaybeWithBuffer).
TokensMaybeWithBuffer = [foo|MyTokens].
请注意,您可以将缓冲区添加到标记列表之前,即使您还不知道标记列表是什么!这就是逻辑变量的力量。其余代码也使用了这种技术。
因此,空输入的情况:
tokenize([], Buffer, Tokens) :-
finish_buffer([], Buffer, Tokens).
例如:
?- tokenize([], '', Tokens).
Tokens = [].
?- tokenize([], 'foo', Tokens).
Tokens = [foo].
其余案例:
tokenize([Parenthesis | Chars], Buffer, TokensWithParenthesis) :-
( Parenthesis = '('
; Parenthesis = ')' ),
finish_buffer([Parenthesis | Tokens], Buffer, TokensWithParenthesis),
tokenize(Chars, '', Tokens).
tokenize([' ' | Chars], Buffer, TokensWithBuffer) :-
finish_buffer(Tokens, Buffer, TokensWithBuffer),
tokenize(Chars, '', Tokens).
tokenize([Char | Chars], Buffer, Tokens) :-
Char \= '(',
Char \= ')',
Char \= ' ',
atom_concat(Buffer, Char, NewBuffer),
tokenize(Chars, NewBuffer, Tokens).
请注意我是如何为不同的情况使用不同的子句的。这使代码更具可读性,但与 (... -> ... ; ...)
相比,它确实有一个缺点,即最后一个子句必须排除前面子句处理的字符。一旦你的代码变成了这种形式,并且你很高兴它能工作,你可以使用 (... -> ... ; ...)
将它转换成一种形式,如果你真的想要的话。
示例:
?- tokenize("(foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.
?- tokenize(" (foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.
?- tokenize("(foo(bar)baz)", '', Tokens).
Tokens = ['(', foo, '(', bar, ')', baz, ')'] ;
false.
最后,非常重要,is
运算符的意思是仅 用于计算算术表达式。当您将它应用于任何非算术运算时,它会抛出异常。合一不同于算术表达式的求值。统一写成=
.
?- X is 2 + 2.
X = 4.
?- X = 2 + 2.
X = 2+2.
?- X is [a, b, c].
ERROR: Arithmetic: `[a,b,c]' is not a function
ERROR: In:
ERROR: [20] throw(error(type_error(evaluable,...),_3362))
ERROR: [17] arithmetic:expand_function([a,b|...],_3400,_3402) at /usr/lib/swi-prolog/library/arithmetic.pl:175
ERROR: [16] arithmetic:math_goal_expansion(_3450 is [a|...],_3446) at /usr/lib/swi-prolog/library/arithmetic.pl:147
ERROR: [14] '$expand':call_goal_expansion([system- ...],_3512 is [a|...],_3492,_3494,_3496) at /usr/lib/swi-prolog/boot/expand.pl:863
ERROR: [13] '$expand':expand_goal(_3566 is [a|...],_3552,_3554,_3556,user,[system- ...],_3562) at /usr/lib/swi-prolog/boot/expand.pl:524
ERROR: [12] setup_call_catcher_cleanup('$expand':'$set_source_module'(user,user),'$expand':expand_goal(...,_3640,_3642,_3644,user,...,_3650),_3614,'$expand':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR: [8] '$expand':expand_goal(user:(_3706 is ...),_3692,user:_3714,_3696) at /usr/lib/swi-prolog/boot/expand.pl:458
ERROR: [6] setup_call_catcher_cleanup('$toplevel':'$set_source_module'(user,user),'$toplevel':expand_goal(...,...),_3742,'$toplevel':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR:
ERROR: Note: some frames are missing due to last-call optimization.
ERROR: Re-run your program in debug mode (:- debug.) to get more detail.
^ Call: (14) call('$expand':'$set_source_module'(user)) ? abort
% Execution Aborted
?- X = [a, b, c].
X = [a, b, c].