这个递归 CodeQL 谓词是如何求值的?

How is this recursive CodeQL predicate is evaluated?

我正在尝试学习 CodeQL,我对某些 CodeQL 代码的评估方式有些困惑。我希望有人能帮我做一个更简单的解释。

取下面的CodeQL代码:

string getANeighbor(string country){
  country = "France" and result = "Belgium"
  or
  country = "France" and result = "Germany"
  or
  country = "Germany" and result = "Austria"
  or
  country = "Germany" and result = "Belgium"
  or
  country = getANeighbor(result)
}

select getANeighbor("Germany")

我得到的结果是:

France
Austria
Belgium

我明白为什么比利时和奥地利被退回了。但是,我对 CodeQL 如何确定结果返回法国感到困惑。我的命令式编程直觉告诉我,要返回法国,我需要一个额外的行,看起来像 country = "Germany" and result = "France",但是如果没有那行代码,我真的很困惑如何返回法国。

此外,这条线是如何工作的?:

country = getANeighbor(result)

通过 CodeQL 手册中给出的一些简单示例,它们使“result”关键字看起来几乎像其他语言中的 'return'语言。我觉得我对“result”的作用及其工作原理存在根本性的误解。谷歌搜索后我似乎找不到很好的解释。

提前致谢!

我在另一个网站上找到了一位用户的问题,他很友好地回答了我的问题。 来源 (https://github.com/github/securitylab/discussions/85):

答案(提供者:Semmle 的 Pavel Avgustinov):

This is a great question. I'll give the full details on recursive predicates below, but let's first deal with result.


Also, how does this line work exactly?:

country = getANeighbor(result)

With some of the simple examples that are given in the CodeQL handbook, they make it seem like that the 'result' keyword almost acts like 'return' in other languages. I feel that I have a fundamental misunderstanding of what 'result' does and how it works. I can't seem to find a good explanation after googling.


You're right to observe that in many examples result = f() in CodeQL seems to act like return f() in other languages. However, a return statement (with its usual meaning of actually interrupting the evaluation of a function and returning to the caller) does not make sense in a logic language, where it's up to the CodeQL evaluator how and in what order to evaluate the different parts of a predicate you define. Instead, when a predicate declares a return type (string in your example), this implicitly declares the special variable result in its body, whose type is the declared return type of the predicate. You can use this variable just like any other, in particular to constrain it as part of the predicate.

Note that this is quite similar to how you implicitly get a special variable called this in a member predicate (or method, in traditional languages).

Other than its implicit declaration, there is nothing special about the result variable. In particular, logically you could write the getANeighbor predicate like this, without its use:

predicate getANeighbor(string country, string neighbor) {   
country = "France" and neighbor = "Belgium"
or
country = "France" and neighbor = "Germany" 
or
country = "Germany" and neighbor = "Austria" 
or 
country = "Germany" and neighbor = "Belgium" 
or  
getANeighbor(neighbor, country)
}

You do have to invoke the predicate differently if you declare it without a result type: getANeighbor(neighbor, country) rather than country = getANeighbor(result). Other than this syntactic detail, the two are entirely equivalent.

If you get used to the idea that result is just an implicitly declared variable, not some special way of signaling the returned value of a predicate, then code like country = getANeighbor(result) (or more complex cases, where result is used multiple times to place multiple conditions on it) will make more sense. Contrast it to the result-free version I gave above, which explicitly flips the order of arguments in the recursive call to getANeighbor: it's exactly the same thing, we're invoking getANeighbor with "flipped" arguments, namely result as the first parameter, and country equated to the "returned" value.

As a final observation, you can of course get something more similar to the return-like style of many examples by introducing an intermediate variable: Instead of country = getANeighbor(result), we can write exists(string neighbor | country = getANeighbor(neighbor) and result = neighbor), but you can see how this ends up being more verbose.


The general intuition behind recursion in CodeQL is given here, but at a high level you can think of each recursive call as representing the "current" set of value tuples in the called predicate, and all predicate definitions keep being evaluated until there is no change. Let's see how this works in your example predicate:

string getANeighbor(string country)
{   
country = "France" and result = "Belgium"   
or
country = "France" and result = "Germany"
or
country = "Germany" and result = "Austria"
or
country = "Germany" and result = "Belgium"
or
country = getANeighbor(result) 
}

The first time we try to evaluate getANeighbour, we do not know anything about it, and so we treat a call to it as a call to the empty set of values. This means we deduce the following values for it:

+-----------+----------+---------+
| iteration | country  | result  |
+-----------+----------+---------+
|    1      | France   | Belgium |
|    1      | France   | Germany |
|    1      | Germany  | Austria |
|    1      | Germany  | Belgium |
+-----------+----------+---------+

However, after deducing these values, we know that there was a recursive call to getANeighbor, and that the values we have for it changed. Thus, we need to re-evaluate the definition of getANeighbor, now using our expanded set of values for the recursive call. This means that we deduce the following:

+-----------+----------+---------+
| iteration | country  | result  |
+-----------+----------+---------+
|    2      | France   | Belgium |
|    2      | France   | Germany |
|    2      | Germany  | Austria |
|    2      | Germany  | Belgium |
|           |          |         |
|    2      | Belgium  | France  |
|    2      | Germany  | France  |
|    2      | Austria  | Germany |
|    2      | Belgium  | Germany |
+-----------+----------+---------+

Note that we re-deduced the first four lines, which do not depend on a recursive call, just like we did in iteration 1. This is fine -- we already knew that "Belgium" = getANeighbor("France") was true, and thus we can simply ignore the fact that we deduced it in another way. However, the last four lines are new; they arise from the recursive call.

As before, we see that iteration 2 of evaluating getANeighbor depended on a recursive predicate for which we have deduced new values, and so we need to go around again. In the third iteration, we deduce:

  • (a) the first four lines of iteration 2 (and iteration 1), just like before, from the non-recursive parts of the predicate definition;
  • (b) the last four lines of iteration 2, in the same way as for iteration 2 itself; and
  • (c) all 8 lines from iteration 2, just by applying the recursive case to the results of iteration 2. However, we didn't deduce any new values, and so we know that at this point we can stop evaluation of getANeighbor. We say that it has "converged" (meaning "stopped changing"), and for any other references to it, we are going to use the final set of values which we computed.

This is why in your query, France is reported as a neighbor of Germany.

[I've omitted some details of how to make such a recursive evaluation efficient -- in practice there are some ways to avoid re-deducing the same values again and again on each recursive iteration --, because they are not necessary to understand how recursion works in the first place.]

再次感谢来自 Semmle 的 Pavel Avgustinov 的详细回复!