为什么我的 Prolog S-expression 分词器在其基本情况下失败？

Question

为了学习一些 Prolog（我正在使用 GNU Prolog）并了解它的解析能力，我开始编写一个 Lisp（或 S-expression，如果我是准确的话）分词器，它给出了一组像 ['(', 'f', 'o', 'o', ')'] 这样的标记应该产生 ['(', 'foo', ')']。它没有按预期工作，这就是我来这里的原因！我认为我的思维过程在我的伪代码中得到了体现：

tokenize([current | rest], buffer, tokens):
    if current is '(' or ')',
        Tokenize the rest,
        And the output will be the current token buffer,
        Plus the parenthesis and the rest.

    if current is ' ',
        Tokenize the rest with a clean buffer,
        And the output will be the buffer plus the rest.
    
    if the tail is empty,
        The output will be a one-element list containing the buffer.
    
    otherwise,
        Add the current character to the buffer,
        And the output will be the rest tokenized, with a bigger buffer.

我把它翻译成 Prolog 是这样的：

tokenize([Char | Chars], Buffer, Tokens) :-
    ((Char = '(' ; Char = ')') ->
        tokenize(Chars, '', Tail_Tokens),
        Tokens is [Buffer, Char | Tail_Tokens];
    Char = ' ' ->
        tokenize(Chars, '', Tail_Tokens),
        Tokens is [Buffer | Tail_Tokens];

    Chars = [] -> Tokens is [Buffer];

    atom_concat(Buffer, Char, New_Buffer),
    tokenize(Chars, New_Buffer, Tokens)).

print_tokens([]) :- write('.').
print_tokens([T | N]) :- write(T), write(', '), print_tokens(N).

main :-
    % tokenize(['(', 'f', 'o', 'o', '(', 'b', 'a', 'r', ')', 'b', 'a', 'z', ')'], '', Tokens),
    tokenize(['(', 'f', 'o', 'o', ')'], '', Tokens),
    print_tokens(Tokens).

当运行结果时，如下所示： gprolog --consult-file lisp_parser.pl 它只是告诉我 no。我跟踪了 main，它给了我下面的堆栈跟踪。我不明白为什么 tokenize 对空案例失败。我看到缓冲区是空的，因为它被前面的 ')' 清除了，但是即使 Tokens 在那个时间点是空的， Tokens 不会递归地累积更大的结果吗？有懂 Prolog 的人可以给我一些提示吗？

| ?- main.

no
| ?- trace.
The debugger will first creep -- showing everything (trace)

(1 ms) yes
{trace}
| ?- main.
      1    1  Call: main ? 
      2    2  Call: tokenize(['(',f,o,o,')'],'',_353) ? 
      3    3  Call: tokenize([f,o,o,')'],'',_378) ? 
      4    4  Call: atom_concat('',f,_403) ? 
      4    4  Exit: atom_concat('',f,f) ? 
      5    4  Call: tokenize([o,o,')'],f,_429) ? 
      6    5  Call: atom_concat(f,o,_454) ? 
      6    5  Exit: atom_concat(f,o,fo) ? 
      7    5  Call: tokenize([o,')'],fo,_480) ? 
      8    6  Call: atom_concat(fo,o,_505) ? 
      8    6  Exit: atom_concat(fo,o,foo) ? 
      9    6  Call: tokenize([')'],foo,_531) ? 
     10    7  Call: tokenize([],'',_556) ? 
     10    7  Fail: tokenize([],'',_544) ? 
      9    6  Fail: tokenize([')'],foo,_519) ? 
      7    5  Fail: tokenize([o,')'],fo,_468) ? 
      5    4  Fail: tokenize([o,o,')'],f,_417) ? 
      3    3  Fail: tokenize([f,o,o,')'],'',_366) ? 
      2    2  Fail: tokenize(['(',f,o,o,')'],'',_341) ? 
      1    1  Fail: main ? 

(1 ms) no
{trace}
| ?-

Answer 1

这个怎么样。我想这就是你想要做的，但让我们使用 Definite Clause Grammars（这只是用 --> 替换 :- 的喇叭子句和两个省略的参数输入字符列表和剩余字符列表。示例DCG规则：

rule(X) --> [c], another_rule(X), {predicate(X)}.

列表处理规则rule//1说：当你在输入列表中找到字符c，然后用another_rule//1继续列表处理，当成功时，调用[=18] =] 正常。

然后：

% If we encounter a separator symbol '(' or ')', we commit to the
% clause using '!' (no point trying anything else, in particular
% not the clause for "other characters", tokenize the rest of the list,
% and when we have done that decide whether 'MaybeToken', which is 
% "part of the leftmost token after '(' or ')'", should be retained.
% it is dropped if it is empty. The caller is then given an empty
% "part of the leftmost token" and the list of tokens, with '(' or ')'
% prepended: "tokenize('', [ '(' | MoreTokens] )  -->"
 
tokenize('', [ '(' | MoreTokens] ) -->
   ['('],
   !,
   tokenize(MaybeToken,Tokens),
   {drop_empty(MaybeToken,Tokens,MoreTokens)}.
   
tokenize('',[')'|MoreTokens]) --> 
   [')'],
   !,
   tokenize(MaybeToken,Tokens),
   {drop_empty(MaybeToken,Tokens,MoreTokens)}.
   
% No more characters in the input list (that's what '--> []' says).
% We succeed, with an empty token list and an empty buffer fro the
% leftmost token.

tokenize('',[]) --> [].

% If we find a 'Ch' that is not '(' or ')', then tokenize
% more of the list via 'tokenize(MaybeToken,Tokens)'. On
% returns 'MaybeToken' is a piece of the leftmost token found
% in that list, so we have to stick 'Ch' onto its start.

tokenize(LargerMaybeToken,Tokens) --> 
   [Ch],
   tokenize(MaybeToken,Tokens),
   {atom_concat(Ch,MaybeToken,LargerMaybeToken)}.

% ---
% This drops an empty "MaybeToken". If "MaybeToken" is 
% *not* empty, it is actually a token and prepended to the list "Tokens"
% ---

drop_empty('',Tokens,Tokens) :- !.
drop_empty(MaybeToken,Tokens,[MaybeToken|Tokens]).

% -----------------
% Call the DCG using phrase/2
% -----------------

tokenize(Text,Result) :-
   phrase( tokenize(MaybeToken,Tokens), Text ),
   drop_empty(MaybeToken,Tokens,Result),!.

等等：

?- tokenize([h,e,l,l,o],R).
R = [hello].

?- tokenize([h,e,l,'(',l,')',o],R).
R = [hel,(,l,),o].

?- tokenize([h,e,l,'(',l,l,')',o],R).
R = [hel,(,ll,),o].

我认为在 GNU Prolog 中，符号 `hello` 直接生成 [h,e,l,l,o]。

Answer 2

I do not understand why tokenize fails for the empty case.

Prolog 中任何事情失败的原因是因为没有使它为真的子句。如果 tokenize 的唯一子句是 tokenize([Char | Chars], ...) 形式，那么 tokenize([], ...) 形式的调用将永远无法匹配此子句，并且由于没有其他子句，调用会失败。

所以你需要添加这样一个子句。但首先：

:- set_prolog_flag(double_quotes, chars).

这允许您将 ['(', f, o, o, ')'] 写成 "foo"。

此外，您必须计划输入完全为空的情况，或者您可能必须为缓冲区发出令牌的其他情况，但前提是它不是 ''（因为应该有结果中没有 '' 个标记。

finish_buffer(Tokens, Buffer, TokensMaybeWithBuffer) :-
    (   Buffer = ''
    ->  TokensMaybeWithBuffer = Tokens
    ;   TokensMaybeWithBuffer = [Buffer | Tokens] ).

例如：

?- finish_buffer(MyTokens, '', TokensMaybeWithBuffer).
MyTokens = TokensMaybeWithBuffer.

?- finish_buffer(MyTokens, 'foo', TokensMaybeWithBuffer).
TokensMaybeWithBuffer = [foo|MyTokens].

请注意，您可以将缓冲区添加到标记列表之前，即使您还不知道标记列表是什么！这就是逻辑变量的力量。其余代码也使用了这种技术。

因此，空输入的情况：

tokenize([], Buffer, Tokens) :-
    finish_buffer([], Buffer, Tokens).

例如：

?- tokenize([], '', Tokens).
Tokens = [].

?- tokenize([], 'foo', Tokens).
Tokens = [foo].

其余案例：

tokenize([Parenthesis | Chars], Buffer, TokensWithParenthesis) :-
    (   Parenthesis = '('
    ;   Parenthesis = ')' ),
    finish_buffer([Parenthesis | Tokens], Buffer, TokensWithParenthesis),
    tokenize(Chars, '', Tokens).
tokenize([' ' | Chars], Buffer, TokensWithBuffer) :-
    finish_buffer(Tokens, Buffer, TokensWithBuffer),
    tokenize(Chars, '', Tokens).
tokenize([Char | Chars], Buffer, Tokens) :-
    Char \= '(',
    Char \= ')',
    Char \= ' ',
    atom_concat(Buffer, Char, NewBuffer),
    tokenize(Chars, NewBuffer, Tokens).

请注意我是如何为不同的情况使用不同的子句的。这使代码更具可读性，但与 (... -> ... ; ...) 相比，它确实有一个缺点，即最后一个子句必须排除前面子句处理的字符。一旦你的代码变成了这种形式，并且你很高兴它能工作，你可以使用 (... -> ... ; ...) 将它转换成一种形式，如果你真的想要的话。

示例：

?- tokenize("(foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.

?- tokenize(" (foo)", '', Tokens).
Tokens = ['(', foo, ')'] ;
false.

?- tokenize("(foo(bar)baz)", '', Tokens).
Tokens = ['(', foo, '(', bar, ')', baz, ')'] ;
false.

最后，非常重要，is 运算符的意思是仅用于计算算术表达式。当您将它应用于任何非算术运算时，它会抛出异常。合一不同于算术表达式的求值。统一写成=.

?- X is 2 + 2.
X = 4.

?- X = 2 + 2.
X = 2+2.

?- X is [a, b, c].
ERROR: Arithmetic: `[a,b,c]' is not a function
ERROR: In:
ERROR:   [20] throw(error(type_error(evaluable,...),_3362))
ERROR:   [17] arithmetic:expand_function([a,b|...],_3400,_3402) at /usr/lib/swi-prolog/library/arithmetic.pl:175
ERROR:   [16] arithmetic:math_goal_expansion(_3450 is [a|...],_3446) at /usr/lib/swi-prolog/library/arithmetic.pl:147
ERROR:   [14] '$expand':call_goal_expansion([system- ...],_3512 is [a|...],_3492,_3494,_3496) at /usr/lib/swi-prolog/boot/expand.pl:863
ERROR:   [13] '$expand':expand_goal(_3566 is [a|...],_3552,_3554,_3556,user,[system- ...],_3562) at /usr/lib/swi-prolog/boot/expand.pl:524
ERROR:   [12] setup_call_catcher_cleanup('$expand':'$set_source_module'(user,user),'$expand':expand_goal(...,_3640,_3642,_3644,user,...,_3650),_3614,'$expand':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR:    [8] '$expand':expand_goal(user:(_3706 is ...),_3692,user:_3714,_3696) at /usr/lib/swi-prolog/boot/expand.pl:458
ERROR:    [6] setup_call_catcher_cleanup('$toplevel':'$set_source_module'(user,user),'$toplevel':expand_goal(...,...),_3742,'$toplevel':'$set_source_module'(user)) at /usr/lib/swi-prolog/boot/init.pl:443
ERROR: 
ERROR: Note: some frames are missing due to last-call optimization.
ERROR: Re-run your program in debug mode (:- debug.) to get more detail.
^  Call: (14) call('$expand':'$set_source_module'(user)) ? abort
% Execution Aborted
?- X = [a, b, c].
X = [a, b, c].

为什么我的 Prolog S-expression 分词器在其基本情况下失败？

Why does my Prolog S-expression tokenizer fail on its base case?

recursion

prolog

tokenize

s-expression