这个递归 CodeQL 谓词是如何求值的?
How is this recursive CodeQL predicate is evaluated?
我正在尝试学习 CodeQL,我对某些 CodeQL 代码的评估方式有些困惑。我希望有人能帮我做一个更简单的解释。
取下面的CodeQL代码:
string getANeighbor(string country){
country = "France" and result = "Belgium"
or
country = "France" and result = "Germany"
or
country = "Germany" and result = "Austria"
or
country = "Germany" and result = "Belgium"
or
country = getANeighbor(result)
}
select getANeighbor("Germany")
我得到的结果是:
France
Austria
Belgium
我明白为什么比利时和奥地利被退回了。但是,我对 CodeQL 如何确定结果返回法国感到困惑。我的命令式编程直觉告诉我,要返回法国,我需要一个额外的行,看起来像 country = "Germany" and result = "France",但是如果没有那行代码,我真的很困惑如何返回法国。
此外,这条线是如何工作的?:
country = getANeighbor(result)
通过 CodeQL 手册中给出的一些简单示例,它们使“result”关键字看起来几乎像其他语言中的 'return'语言。我觉得我对“result”的作用及其工作原理存在根本性的误解。谷歌搜索后我似乎找不到很好的解释。
提前致谢!
我在另一个网站上找到了一位用户的问题,他很友好地回答了我的问题。
来源 (https://github.com/github/securitylab/discussions/85):
答案(提供者:Semmle 的 Pavel Avgustinov):
This is a great question. I'll give the full details on recursive
predicates below, but let's first deal with result
.
Also, how does this line work exactly?:
country = getANeighbor(result)
With some of the simple examples that are given in the CodeQL handbook, they make it seem like that the 'result' keyword almost acts
like 'return' in other languages. I feel that I have a fundamental
misunderstanding of what 'result' does and how it works. I can't seem
to find a good explanation after googling.
You're right to observe that in many examples result = f()
in CodeQL
seems to act like return f()
in other languages. However, a return
statement (with its usual meaning of actually interrupting the
evaluation of a function and returning to the caller) does not make
sense in a logic language, where it's up to the CodeQL evaluator how
and in what order to evaluate the different parts of a predicate you
define. Instead, when a predicate declares a return type (string
in
your example), this implicitly declares the special variable result
in its body, whose type is the declared return type of the predicate.
You can use this variable just like any other, in particular to
constrain it as part of the predicate.
Note that this
is quite similar to how you implicitly get a special
variable called this in a member predicate (or method, in traditional
languages).
Other than its implicit declaration, there is nothing special about
the result
variable. In particular, logically you could write the
getANeighbor
predicate like this, without its use:
predicate getANeighbor(string country, string neighbor) {
country = "France" and neighbor = "Belgium"
or
country = "France" and neighbor = "Germany"
or
country = "Germany" and neighbor = "Austria"
or
country = "Germany" and neighbor = "Belgium"
or
getANeighbor(neighbor, country)
}
You do have to invoke the predicate differently if you declare it
without a result type: getANeighbor(neighbor, country)
rather than
country = getANeighbor(result)
. Other than this syntactic detail,
the two are entirely equivalent.
If you get used to the idea that result
is just an implicitly
declared variable, not some special way of signaling the returned
value of a predicate, then code like country = getANeighbor(result)
(or more complex cases, where result
is used multiple times to place
multiple conditions on it) will make more sense. Contrast it to the
result
-free version I gave above, which explicitly flips the order
of arguments in the recursive call to getANeighbor
: it's exactly the
same thing, we're invoking getANeighbor
with "flipped" arguments,
namely result
as the first parameter, and country
equated to the
"returned" value.
As a final observation, you can of course get something more similar
to the return
-like style of many examples by introducing an
intermediate variable: Instead of country = getANeighbor(result)
, we
can write exists(string neighbor | country = getANeighbor(neighbor) and result = neighbor)
, but you can see how this ends up being more
verbose.
The general intuition behind recursion in CodeQL is given here,
but at a high level you can think of each recursive call as
representing the "current" set of value tuples in the called
predicate, and all predicate definitions keep being evaluated until
there is no change. Let's see how this works in your example
predicate:
string getANeighbor(string country)
{
country = "France" and result = "Belgium"
or
country = "France" and result = "Germany"
or
country = "Germany" and result = "Austria"
or
country = "Germany" and result = "Belgium"
or
country = getANeighbor(result)
}
The first time we try to evaluate getANeighbour
, we do not know
anything about it, and so we treat a call to it as a call to the empty
set of values. This means we deduce the following values for it:
+-----------+----------+---------+
| iteration | country | result |
+-----------+----------+---------+
| 1 | France | Belgium |
| 1 | France | Germany |
| 1 | Germany | Austria |
| 1 | Germany | Belgium |
+-----------+----------+---------+
However, after deducing these values, we know that there was a
recursive call to getANeighbor
, and that the values we have for it
changed. Thus, we need to re-evaluate the definition of
getANeighbor
, now using our expanded set of values for the recursive
call. This means that we deduce the following:
+-----------+----------+---------+
| iteration | country | result |
+-----------+----------+---------+
| 2 | France | Belgium |
| 2 | France | Germany |
| 2 | Germany | Austria |
| 2 | Germany | Belgium |
| | | |
| 2 | Belgium | France |
| 2 | Germany | France |
| 2 | Austria | Germany |
| 2 | Belgium | Germany |
+-----------+----------+---------+
Note that we re-deduced the first four lines, which do not depend on a
recursive call, just like we did in iteration 1. This is fine -- we
already knew that "Belgium" = getANeighbor("France")
was true, and
thus we can simply ignore the fact that we deduced it in another way.
However, the last four lines are new; they arise from the recursive
call.
As before, we see that iteration 2 of evaluating getANeighbor
depended on a recursive predicate for which we have deduced new
values, and so we need to go around again. In the third iteration, we
deduce:
- (a) the first four lines of iteration 2 (and iteration 1), just like before, from the non-recursive parts of the predicate definition;
- (b) the last four lines of iteration 2, in the same way as for iteration 2 itself; and
- (c) all 8 lines from iteration 2, just by applying the recursive case to the results of iteration 2. However, we didn't deduce any
new values, and so we know that at this point we can stop evaluation of
getANeighbor
. We say that it has "converged" (meaning "stopped
changing"), and for any other references to it, we are going to use
the final set of values which we computed.
This is why in your query, France
is reported as a neighbor of
Germany
.
[I've omitted some details of how to make such a recursive evaluation
efficient -- in practice there are some ways to avoid re-deducing the
same values again and again on each recursive iteration --, because
they are not necessary to understand how recursion works in the first
place.]
再次感谢来自 Semmle 的 Pavel Avgustinov 的详细回复!
我正在尝试学习 CodeQL,我对某些 CodeQL 代码的评估方式有些困惑。我希望有人能帮我做一个更简单的解释。
取下面的CodeQL代码:
string getANeighbor(string country){
country = "France" and result = "Belgium"
or
country = "France" and result = "Germany"
or
country = "Germany" and result = "Austria"
or
country = "Germany" and result = "Belgium"
or
country = getANeighbor(result)
}
select getANeighbor("Germany")
我得到的结果是:
France
Austria
Belgium
我明白为什么比利时和奥地利被退回了。但是,我对 CodeQL 如何确定结果返回法国感到困惑。我的命令式编程直觉告诉我,要返回法国,我需要一个额外的行,看起来像 country = "Germany" and result = "France",但是如果没有那行代码,我真的很困惑如何返回法国。
此外,这条线是如何工作的?:
country = getANeighbor(result)
通过 CodeQL 手册中给出的一些简单示例,它们使“result”关键字看起来几乎像其他语言中的 'return'语言。我觉得我对“result”的作用及其工作原理存在根本性的误解。谷歌搜索后我似乎找不到很好的解释。
提前致谢!
我在另一个网站上找到了一位用户的问题,他很友好地回答了我的问题。 来源 (https://github.com/github/securitylab/discussions/85):
答案(提供者:Semmle 的 Pavel Avgustinov):
This is a great question. I'll give the full details on recursive predicates below, but let's first deal with
result
.
Also, how does this line work exactly?:
country = getANeighbor(result)
With some of the simple examples that are given in the CodeQL handbook, they make it seem like that the 'result' keyword almost acts like 'return' in other languages. I feel that I have a fundamental misunderstanding of what 'result' does and how it works. I can't seem to find a good explanation after googling.
You're right to observe that in many examples
result = f()
in CodeQL seems to act likereturn f()
in other languages. However, areturn
statement (with its usual meaning of actually interrupting the evaluation of a function and returning to the caller) does not make sense in a logic language, where it's up to the CodeQL evaluator how and in what order to evaluate the different parts of a predicate you define. Instead, when a predicate declares a return type (string
in your example), this implicitly declares the special variableresult
in its body, whose type is the declared return type of the predicate. You can use this variable just like any other, in particular to constrain it as part of the predicate.Note that
this
is quite similar to how you implicitly get a special variable called this in a member predicate (or method, in traditional languages).Other than its implicit declaration, there is nothing special about the
result
variable. In particular, logically you could write thegetANeighbor
predicate like this, without its use:predicate getANeighbor(string country, string neighbor) { country = "France" and neighbor = "Belgium" or country = "France" and neighbor = "Germany" or country = "Germany" and neighbor = "Austria" or country = "Germany" and neighbor = "Belgium" or getANeighbor(neighbor, country) }
You do have to invoke the predicate differently if you declare it without a result type:
getANeighbor(neighbor, country)
rather thancountry = getANeighbor(result)
. Other than this syntactic detail, the two are entirely equivalent.If you get used to the idea that
result
is just an implicitly declared variable, not some special way of signaling the returned value of a predicate, then code likecountry = getANeighbor(result)
(or more complex cases, whereresult
is used multiple times to place multiple conditions on it) will make more sense. Contrast it to theresult
-free version I gave above, which explicitly flips the order of arguments in the recursive call togetANeighbor
: it's exactly the same thing, we're invokinggetANeighbor
with "flipped" arguments, namelyresult
as the first parameter, andcountry
equated to the "returned" value.As a final observation, you can of course get something more similar to the
return
-like style of many examples by introducing an intermediate variable: Instead ofcountry = getANeighbor(result)
, we can writeexists(string neighbor | country = getANeighbor(neighbor) and result = neighbor)
, but you can see how this ends up being more verbose.
The general intuition behind recursion in CodeQL is given here, but at a high level you can think of each recursive call as representing the "current" set of value tuples in the called predicate, and all predicate definitions keep being evaluated until there is no change. Let's see how this works in your example predicate:
string getANeighbor(string country) { country = "France" and result = "Belgium" or country = "France" and result = "Germany" or country = "Germany" and result = "Austria" or country = "Germany" and result = "Belgium" or country = getANeighbor(result) }
The first time we try to evaluate
getANeighbour
, we do not know anything about it, and so we treat a call to it as a call to the empty set of values. This means we deduce the following values for it:+-----------+----------+---------+ | iteration | country | result | +-----------+----------+---------+ | 1 | France | Belgium | | 1 | France | Germany | | 1 | Germany | Austria | | 1 | Germany | Belgium | +-----------+----------+---------+
However, after deducing these values, we know that there was a recursive call to
getANeighbor
, and that the values we have for it changed. Thus, we need to re-evaluate the definition ofgetANeighbor
, now using our expanded set of values for the recursive call. This means that we deduce the following:+-----------+----------+---------+ | iteration | country | result | +-----------+----------+---------+ | 2 | France | Belgium | | 2 | France | Germany | | 2 | Germany | Austria | | 2 | Germany | Belgium | | | | | | 2 | Belgium | France | | 2 | Germany | France | | 2 | Austria | Germany | | 2 | Belgium | Germany | +-----------+----------+---------+
Note that we re-deduced the first four lines, which do not depend on a recursive call, just like we did in iteration 1. This is fine -- we already knew that
"Belgium" = getANeighbor("France")
was true, and thus we can simply ignore the fact that we deduced it in another way. However, the last four lines are new; they arise from the recursive call.As before, we see that iteration 2 of evaluating
getANeighbor
depended on a recursive predicate for which we have deduced new values, and so we need to go around again. In the third iteration, we deduce:
- (a) the first four lines of iteration 2 (and iteration 1), just like before, from the non-recursive parts of the predicate definition;
- (b) the last four lines of iteration 2, in the same way as for iteration 2 itself; and
- (c) all 8 lines from iteration 2, just by applying the recursive case to the results of iteration 2. However, we didn't deduce any new values, and so we know that at this point we can stop evaluation of
getANeighbor
. We say that it has "converged" (meaning "stopped changing"), and for any other references to it, we are going to use the final set of values which we computed.This is why in your query,
France
is reported as a neighbor ofGermany
.[I've omitted some details of how to make such a recursive evaluation efficient -- in practice there are some ways to avoid re-deducing the same values again and again on each recursive iteration --, because they are not necessary to understand how recursion works in the first place.]
再次感谢来自 Semmle 的 Pavel Avgustinov 的详细回复!