R 究竟如何解析 `->`,即 right-assignment 运算符?
How exactly does R parse `->`, the right-assignment operator?
所以这是一个微不足道的问题,但我无法回答这个问题让我很烦恼,也许这个答案会告诉我更多关于 R 工作原理的细节。
标题说明了一切:R 如何解析 ->
、晦涩难懂的 right-side 赋值函数?
我常用的尝试失败的技巧:
`->`
Error: object ->
not found
getAnywhere("->")
no object named ->
was found
而且我们不能直接调用它:
`->`(3,x)
Error: could not find function "->"
但当然有效:
(3 -> x) #assigns the value 3 to the name x
# [1] 3
看来 R 知道如何简单地反转参数,但我认为上述方法肯定会破解此案:
pryr::ast(3 -> y)
# \- ()
# \- `<- #R interpreter clearly flipped things around
# \- `y # (by the time it gets to `ast`, at least...)
# \- 3 # (note: this is because `substitute(3 -> y)`
# # already returns the reversed version)
将其与常规赋值运算符进行比较:
`<-`
.Primitive("<-")
`<-`(x, 3) #assigns the value 3 to the name x, as expected
?"->"
、?assignOps
和 R Language Definition 都简单地提到它作为正确的赋值运算符。
但是 ->
的使用方式显然有一些独特之处。它不是 function/operator(正如调用 getAnywhere
和直接调用 `->`
所证明的那样),那它是什么?它是否完全属于自己的 class?
除了“->
在 R 语言中的解释和处理方式是完全独一无二的;记住并继续前进”之外,还有什么可以从中学习的吗?
首先让我说我对解析器的工作原理一无所知。话虽如此,line 296 of gram.y 定义了以下标记来表示 R 使用的(YACC?)解析器中的赋值:
%token LEFT_ASSIGN EQ_ASSIGN RIGHT_ASSIGN LBB
然后,on lines 5140 through 5150 of gram.c,这看起来像对应的C代码:
case '-':
if (nextchar('>')) {
if (nextchar('>')) {
yylval = install_and_save2("<<-", "->>");
return RIGHT_ASSIGN;
}
else {
yylval = install_and_save2("<-", "->");
return RIGHT_ASSIGN;
}
}
最后,从line 5044 of gram.c开始,install_and_save2
的定义:
/* Get an R symbol, and set different yytext. Used for translation of -> to <-. ->> to <<- */
static SEXP install_and_save2(char * text, char * savetext)
{
strcpy(yytext, savetext);
return install(text);
}
再说一次,使用解析器的经验为零,似乎 ->
和 ->>
分别直接翻译成 <-
和 <<-
,在 解释过程中水平很低。
你在询问解析器 "knows" 如何将参数反转为 ->
时提出了一个很好的观点 - 考虑到 ->
似乎安装到 R 符号 table 为 <-
- 因此能够正确地将 x -> y
解释为 y <- x
和 而不是 x <- y
。我能做的最好的事情就是在我继续遇到 "evidence" 来支持我的主张时提供进一步的推测。希望一些仁慈的 YACC 专家会偶然发现这个问题并提供一些见解;不过,我不会对此屏住呼吸。
回到lines 383 and 384 of gram.y,这看起来像是与上述LEFT_ASSIGN
和RIGHT_ASSIGN
符号相关的更多解析逻辑:
| expr LEFT_ASSIGN expr { $$ = xxbinary(,,); setId( $$, @$); }
| expr RIGHT_ASSIGN expr { $$ = xxbinary(,,); setId( $$, @$); }
虽然我不能真正理解这种疯狂的语法,但我确实注意到 xxbinary
的第二个和第三个参数被交换为 WRT LEFT_ASSIGN
(xxbinary(,,)
) 和 RIGHT_ASSIGN
(xxbinary(,,)
)。
这是我脑海中的画面:
LEFT_ASSIGN
场景:y <- x
</code> 是上述表达式中解析器的第二个 "argument",即 <code><-
</code>是第一个;即 <code>y
</code>是第三个; <code>x
因此,生成的 (C?) 调用将是 xxbinary(<-, y, x)
。
将此逻辑应用于 RIGHT_ASSIGN
,即 x -> y
,结合我之前关于 <-
和 ->
被交换的猜想,
</code> 从 <code>->
翻译成 <-
</code> 是 <code>x
</code> 是 <code>y
但是由于结果是xxbinary(,,)
而不是xxbinary(,,)
,结果是还是xxbinary(<-, y, x)
。
进一步构建,我们在 line 3310 of gram.c 上定义了 xxbinary
:
static SEXP xxbinary(SEXP n1, SEXP n2, SEXP n3)
{
SEXP ans;
if (GenerateCode)
PROTECT(ans = lang3(n1, n2, n3));
else
PROTECT(ans = R_NilValue);
UNPROTECT_PTR(n2);
UNPROTECT_PTR(n3);
return ans;
}
不幸的是,我在 R 源代码中找不到 lang3
(或其变体 lang1
、lang2
等...)的正确定义,但我假设它用于以与解释器同步的方式评估特殊功能(即符号)。
更新
考虑到我对解析过程的(非常)有限的知识,我将尽力在评论中解决您的一些其他问题。
1) Is this really the only object in R that behaves like this?? (I've
got in mind the John Chambers quote via Hadley's book: "Everything
that exists is an object. Everything that happens is a function call."
This clearly lies outside that domain -- is there anything else like
this?
首先,我同意这不属于该领域。我相信 Chambers 的引述与 R 环境有关,即在这个低级解析阶段之后发生的所有过程。不过,我将在下面详细介绍这一点。无论如何,我能找到的这种行为的唯一其他示例是 **
运算符,它是更常见的求幂运算符 ^
的同义词。与正确的分配一样,**
似乎不是 "recognized" 作为函数调用等......解释器:
R> `->`
#Error: object '->' not found
R> `**`
#Error: object '**' not found
我找到这个是因为这是唯一的其他情况 install_and_save2
is used by the C parser:
case '*':
/* Replace ** by ^. This has been here since 1998, but is
undocumented (at least in the obvious places). It is in
the index of the Blue Book with a reference to p. 431, the
help for 'Deprecated'. S-PLUS 6.2 still allowed this, so
presumably it was for compatibility with S. */
if (nextchar('*')) {
yylval = install_and_save2("^", "**");
return '^';
} else
yylval = install_and_save("*");
return c;
2) When exactly does this happen? I've got in mind that substitute(3
-> y) has already flipped the expression; I couldn't figure out from the source what substitute does that would have pinged the YACC...
当然我仍然在这里推测,但是是的,我认为我们可以安全地假设当你调用 substitute(3 -> y)
时,从 the substitute function, the expression always was y <- 3
; e.g. the function is completely unaware that you typed 3 -> y
. do_substitute
, like 99% of the C functions used by R, only handles SEXP
arguments - an EXPRSXP
in the case of 3 -> y
(== y <- 3
), I believe. This is what I was alluding to above when I made a distinction between the R Environment and the parsing process. I don't think there is anything that specifically triggers the parser to spring into action - but rather everything you input into the interpreter gets parsed. I did a little more reading about the YACC / Bison parser generator last night, and as I understand it (a.k.a. don't bet the farm on this), Bison uses the grammar you define (in the .y
file(s)) to generate a parser in C - i.e. a C function which does the actual parsing of input. In turn, everything you input in an R session is first processed by this C parsing function, which then delegates the appropriate action to be taken in the R Environment (I'm using this term very loosely by the way). During this phase, lhs -> rhs
will get translated to rhs <- lhs
, **
to ^
, etc... For example, this is an excerpt from one of the tables of primitive functions in names.c 的角度来看:
/* Language Related Constructs */
/* Primitives */
{"if", do_if, 0, 200, -1, {PP_IF, PREC_FN, 1}},
{"while", do_while, 0, 100, 2, {PP_WHILE, PREC_FN, 0}},
{"for", do_for, 0, 100, 3, {PP_FOR, PREC_FN, 0}},
{"repeat", do_repeat, 0, 100, 1, {PP_REPEAT, PREC_FN, 0}},
{"break", do_break, CTXT_BREAK, 0, 0, {PP_BREAK, PREC_FN, 0}},
{"next", do_break, CTXT_NEXT, 0, 0, {PP_NEXT, PREC_FN, 0}},
{"return", do_return, 0, 0, -1, {PP_RETURN, PREC_FN, 0}},
{"function", do_function, 0, 0, -1, {PP_FUNCTION,PREC_FN, 0}},
{"<-", do_set, 1, 100, -1, {PP_ASSIGN, PREC_LEFT, 1}},
{"=", do_set, 3, 100, -1, {PP_ASSIGN, PREC_EQ, 1}},
{"<<-", do_set, 2, 100, -1, {PP_ASSIGN2, PREC_LEFT, 1}},
{"{", do_begin, 0, 200, -1, {PP_CURLY, PREC_FN, 0}},
{"(", do_paren, 0, 1, 1, {PP_PAREN, PREC_FN, 0}},
您会注意到 ->
、->>
和 **
未在此处定义。据我所知,<-
和 [
等 R 原始表达式是 R 环境与任何底层 C 代码最接近的交互。我的建议是,在这个过程的这个阶段(从你在解释器中输入一组字符并点击 'Enter',到有效 R 表达式的实际评估),解析器已经发挥了它的魔力,这这就是为什么您不能像通常那样用反引号包围 ->
或 **
来获得函数定义的原因。
所以这是一个微不足道的问题,但我无法回答这个问题让我很烦恼,也许这个答案会告诉我更多关于 R 工作原理的细节。
标题说明了一切:R 如何解析 ->
、晦涩难懂的 right-side 赋值函数?
我常用的尝试失败的技巧:
`->`
Error: object
->
not found
getAnywhere("->")
no object named
->
was found
而且我们不能直接调用它:
`->`(3,x)
Error: could not find function
"->"
但当然有效:
(3 -> x) #assigns the value 3 to the name x
# [1] 3
看来 R 知道如何简单地反转参数,但我认为上述方法肯定会破解此案:
pryr::ast(3 -> y)
# \- ()
# \- `<- #R interpreter clearly flipped things around
# \- `y # (by the time it gets to `ast`, at least...)
# \- 3 # (note: this is because `substitute(3 -> y)`
# # already returns the reversed version)
将其与常规赋值运算符进行比较:
`<-`
.Primitive("<-")
`<-`(x, 3) #assigns the value 3 to the name x, as expected
?"->"
、?assignOps
和 R Language Definition 都简单地提到它作为正确的赋值运算符。
但是 ->
的使用方式显然有一些独特之处。它不是 function/operator(正如调用 getAnywhere
和直接调用 `->`
所证明的那样),那它是什么?它是否完全属于自己的 class?
除了“->
在 R 语言中的解释和处理方式是完全独一无二的;记住并继续前进”之外,还有什么可以从中学习的吗?
首先让我说我对解析器的工作原理一无所知。话虽如此,line 296 of gram.y 定义了以下标记来表示 R 使用的(YACC?)解析器中的赋值:
%token LEFT_ASSIGN EQ_ASSIGN RIGHT_ASSIGN LBB
然后,on lines 5140 through 5150 of gram.c,这看起来像对应的C代码:
case '-':
if (nextchar('>')) {
if (nextchar('>')) {
yylval = install_and_save2("<<-", "->>");
return RIGHT_ASSIGN;
}
else {
yylval = install_and_save2("<-", "->");
return RIGHT_ASSIGN;
}
}
最后,从line 5044 of gram.c开始,install_and_save2
的定义:
/* Get an R symbol, and set different yytext. Used for translation of -> to <-. ->> to <<- */
static SEXP install_and_save2(char * text, char * savetext)
{
strcpy(yytext, savetext);
return install(text);
}
再说一次,使用解析器的经验为零,似乎 ->
和 ->>
分别直接翻译成 <-
和 <<-
,在 解释过程中水平很低。
你在询问解析器 "knows" 如何将参数反转为 ->
时提出了一个很好的观点 - 考虑到 ->
似乎安装到 R 符号 table 为 <-
- 因此能够正确地将 x -> y
解释为 y <- x
和 而不是 x <- y
。我能做的最好的事情就是在我继续遇到 "evidence" 来支持我的主张时提供进一步的推测。希望一些仁慈的 YACC 专家会偶然发现这个问题并提供一些见解;不过,我不会对此屏住呼吸。
回到lines 383 and 384 of gram.y,这看起来像是与上述LEFT_ASSIGN
和RIGHT_ASSIGN
符号相关的更多解析逻辑:
| expr LEFT_ASSIGN expr { $$ = xxbinary(,,); setId( $$, @$); }
| expr RIGHT_ASSIGN expr { $$ = xxbinary(,,); setId( $$, @$); }
虽然我不能真正理解这种疯狂的语法,但我确实注意到 xxbinary
的第二个和第三个参数被交换为 WRT LEFT_ASSIGN
(xxbinary(,,)
) 和 RIGHT_ASSIGN
(xxbinary(,,)
)。
这是我脑海中的画面:
LEFT_ASSIGN
场景:y <- x
</code> 是上述表达式中解析器的第二个 "argument",即 <code><-
</code>是第一个;即 <code>y
</code>是第三个; <code>x
因此,生成的 (C?) 调用将是 xxbinary(<-, y, x)
。
将此逻辑应用于 RIGHT_ASSIGN
,即 x -> y
,结合我之前关于 <-
和 ->
被交换的猜想,
</code> 从 <code>->
翻译成<-
</code> 是 <code>x
</code> 是 <code>y
但是由于结果是xxbinary(,,)
而不是xxbinary(,,)
,结果是还是xxbinary(<-, y, x)
。
进一步构建,我们在 line 3310 of gram.c 上定义了 xxbinary
:
static SEXP xxbinary(SEXP n1, SEXP n2, SEXP n3)
{
SEXP ans;
if (GenerateCode)
PROTECT(ans = lang3(n1, n2, n3));
else
PROTECT(ans = R_NilValue);
UNPROTECT_PTR(n2);
UNPROTECT_PTR(n3);
return ans;
}
不幸的是,我在 R 源代码中找不到 lang3
(或其变体 lang1
、lang2
等...)的正确定义,但我假设它用于以与解释器同步的方式评估特殊功能(即符号)。
更新 考虑到我对解析过程的(非常)有限的知识,我将尽力在评论中解决您的一些其他问题。
1) Is this really the only object in R that behaves like this?? (I've got in mind the John Chambers quote via Hadley's book: "Everything that exists is an object. Everything that happens is a function call." This clearly lies outside that domain -- is there anything else like this?
首先,我同意这不属于该领域。我相信 Chambers 的引述与 R 环境有关,即在这个低级解析阶段之后发生的所有过程。不过,我将在下面详细介绍这一点。无论如何,我能找到的这种行为的唯一其他示例是 **
运算符,它是更常见的求幂运算符 ^
的同义词。与正确的分配一样,**
似乎不是 "recognized" 作为函数调用等......解释器:
R> `->`
#Error: object '->' not found
R> `**`
#Error: object '**' not found
我找到这个是因为这是唯一的其他情况 install_and_save2
is used by the C parser:
case '*':
/* Replace ** by ^. This has been here since 1998, but is
undocumented (at least in the obvious places). It is in
the index of the Blue Book with a reference to p. 431, the
help for 'Deprecated'. S-PLUS 6.2 still allowed this, so
presumably it was for compatibility with S. */
if (nextchar('*')) {
yylval = install_and_save2("^", "**");
return '^';
} else
yylval = install_and_save("*");
return c;
2) When exactly does this happen? I've got in mind that substitute(3 -> y) has already flipped the expression; I couldn't figure out from the source what substitute does that would have pinged the YACC...
当然我仍然在这里推测,但是是的,我认为我们可以安全地假设当你调用 substitute(3 -> y)
时,从 the substitute function, the expression always was y <- 3
; e.g. the function is completely unaware that you typed 3 -> y
. do_substitute
, like 99% of the C functions used by R, only handles SEXP
arguments - an EXPRSXP
in the case of 3 -> y
(== y <- 3
), I believe. This is what I was alluding to above when I made a distinction between the R Environment and the parsing process. I don't think there is anything that specifically triggers the parser to spring into action - but rather everything you input into the interpreter gets parsed. I did a little more reading about the YACC / Bison parser generator last night, and as I understand it (a.k.a. don't bet the farm on this), Bison uses the grammar you define (in the .y
file(s)) to generate a parser in C - i.e. a C function which does the actual parsing of input. In turn, everything you input in an R session is first processed by this C parsing function, which then delegates the appropriate action to be taken in the R Environment (I'm using this term very loosely by the way). During this phase, lhs -> rhs
will get translated to rhs <- lhs
, **
to ^
, etc... For example, this is an excerpt from one of the tables of primitive functions in names.c 的角度来看:
/* Language Related Constructs */
/* Primitives */
{"if", do_if, 0, 200, -1, {PP_IF, PREC_FN, 1}},
{"while", do_while, 0, 100, 2, {PP_WHILE, PREC_FN, 0}},
{"for", do_for, 0, 100, 3, {PP_FOR, PREC_FN, 0}},
{"repeat", do_repeat, 0, 100, 1, {PP_REPEAT, PREC_FN, 0}},
{"break", do_break, CTXT_BREAK, 0, 0, {PP_BREAK, PREC_FN, 0}},
{"next", do_break, CTXT_NEXT, 0, 0, {PP_NEXT, PREC_FN, 0}},
{"return", do_return, 0, 0, -1, {PP_RETURN, PREC_FN, 0}},
{"function", do_function, 0, 0, -1, {PP_FUNCTION,PREC_FN, 0}},
{"<-", do_set, 1, 100, -1, {PP_ASSIGN, PREC_LEFT, 1}},
{"=", do_set, 3, 100, -1, {PP_ASSIGN, PREC_EQ, 1}},
{"<<-", do_set, 2, 100, -1, {PP_ASSIGN2, PREC_LEFT, 1}},
{"{", do_begin, 0, 200, -1, {PP_CURLY, PREC_FN, 0}},
{"(", do_paren, 0, 1, 1, {PP_PAREN, PREC_FN, 0}},
您会注意到 ->
、->>
和 **
未在此处定义。据我所知,<-
和 [
等 R 原始表达式是 R 环境与任何底层 C 代码最接近的交互。我的建议是,在这个过程的这个阶段(从你在解释器中输入一组字符并点击 'Enter',到有效 R 表达式的实际评估),解析器已经发挥了它的魔力,这这就是为什么您不能像通常那样用反引号包围 ->
或 **
来获得函数定义的原因。