Clang 的 "did you mean ...?" 变量名校正算法是如何工作的?

How does Clang's "did you mean ...?" variable name correction algorithm work?

我正在用 Clang 编译 C++ 代码。 (Apple clang version 12.0.5 (clang-1205.0.22.11)).

Clang 可以在您拼错变量时提供提示:

#include <iostream>

int main() {
    int my_int;
    std::cout << my_it << std::endl;
}
spellcheck-test.cpp:5:18: error: use of undeclared identifier 'my_it'; did you mean 'my_int'?
    std::cout << my_it << std::endl;
                 ^~~~~
                 my_int
spellcheck-test.cpp:4:9: note: 'my_int' declared here
    int my_int;
        ^
1 error generated.

我的问题是:

What is the criterion Clang uses to determine when to suggest another variable?

我的实验表明它非常复杂:

您可能不会找到它的文档,但由于 Clang 是开源的,您可以求助于源代码以尝试弄明白。

Clangd?

特定诊断(来自 DiagnosticSemaKinds.td):

def err_undeclared_var_use_suggest : Error<
  "use of undeclared identifier %0; did you mean %1?">;

仅从 clang-tools-extra/clangd/IncludeFixer.cpp:

引用
  // Try to fix unresolved name caused by missing declaration.
  // E.g.
  //   clang::SourceManager SM;
  //          ~~~~~~~~~~~~~
  //          UnresolvedName
  //   or
  //   namespace clang {  SourceManager SM; }
  //                      ~~~~~~~~~~~~~
  //                      UnresolvedName
  // We only attempt to recover a diagnostic if it has the same location as
  // the last seen unresolved name.
  if (DiagLevel >= DiagnosticsEngine::Error &&
      LastUnresolvedName->Loc == Info.getLocation())
    return fixUnresolvedName();

现在,clangd 是语言服务器,t.b.h。我不知道 Clang 编译器前端是否实际使用它来产生某些诊断,但您可以自由地继续探索,将这些细节联系在一起。上面的fixUnresolvedName最终进行了模糊搜索:

if (llvm::Optional<const SymbolSlab *> Syms = fuzzyFindCached(Req))
    return fixesForSymbols(**Syms);

如果您想深入了解细节,我建议您从 the fuzzyFindCached function:

开始
llvm::Optional<const SymbolSlab *>
IncludeFixer::fuzzyFindCached(const FuzzyFindRequest &Req) const {
  auto ReqStr = llvm::formatv("{0}", toJSON(Req)).str();
  auto I = FuzzyFindCache.find(ReqStr);
  if (I != FuzzyFindCache.end())
    return &I->second;

  if (IndexRequestCount >= IndexRequestLimit)
    return llvm::None;
  IndexRequestCount++;

  SymbolSlab::Builder Matches;
  Index.fuzzyFind(Req, [&](const Symbol &Sym) {
    if (Sym.Name != Req.Query)
      return;
    if (!Sym.IncludeHeaders.empty())
      Matches.insert(Sym);
  });
  auto Syms = std::move(Matches).build();
  auto E = FuzzyFindCache.try_emplace(ReqStr, std::move(Syms));
  return &E.first->second;
}

连同其单一函数参数的类型,FuzzyFindRequest in clang/index/Index.h:

struct FuzzyFindRequest {
  /// A query string for the fuzzy find. This is matched against symbols'
  /// un-qualified identifiers and should not contain qualifiers like "::".
  std::string Query;
  /// If this is non-empty, symbols must be in at least one of the scopes
  /// (e.g. namespaces) excluding nested scopes. For example, if a scope "xyz::"
  /// is provided, the matched symbols must be defined in namespace xyz but not
  /// namespace xyz::abc.
  ///
  /// The global scope is "", a top level scope is "foo::", etc.
  std::vector<std::string> Scopes;
  /// If set to true, allow symbols from any scope. Scopes explicitly listed
  /// above will be ranked higher.
  bool AnyScope = false;
  /// The number of top candidates to return. The index may choose to
  /// return more than this, e.g. if it doesn't know which candidates are best.
  llvm::Optional<uint32_t> Limit;
  /// If set to true, only symbols for completion support will be considered.
  bool RestrictForCodeCompletion = false;
  /// Contextually relevant files (e.g. the file we're code-completing in).
  /// Paths should be absolute.
  std::vector<std::string> ProximityPaths;
  /// Preferred types of symbols. These are raw representation of `OpaqueType`.
  std::vector<std::string> PreferredTypes;

  bool operator==(const FuzzyFindRequest &Req) const {
    return std::tie(Query, Scopes, Limit, RestrictForCodeCompletion,
                    ProximityPaths, PreferredTypes) ==
           std::tie(Req.Query, Req.Scopes, Req.Limit,
                    Req.RestrictForCodeCompletion, Req.ProximityPaths,
                    Req.PreferredTypes);
  }
  bool operator!=(const FuzzyFindRequest &Req) const { return !(*this == Req); }
};

其他兔子洞?

以下提交可能是另一条起点:

[Frontend] Allow attaching an external sema source to compiler instance and extra diags to TypoCorrections

This can be used to append alternative typo corrections to an existing diag. include-fixer can use it to suggest includes to be added.

Differential Revision: https://reviews.llvm.org/D26745

从中我们可能会得到 clang/include/clang/Sema/TypoCorrection.h,这听起来像是编译器前端比(clang 额外工具)clangd 更合理地使用的功能。例如:

  /// Gets the "edit distance" of the typo correction from the typo.
  /// If Normalized is true, scale the distance down by the CharDistanceWeight
  /// to return the edit distance in terms of single-character edits.
  unsigned getEditDistance(bool Normalized = true) const {
    if (CharDistance > MaximumDistance || QualifierDistance > MaximumDistance ||
        CallbackDistance > MaximumDistance)
      return InvalidDistance;
    unsigned ED =
        CharDistance * CharDistanceWeight +
        QualifierDistance * QualifierDistanceWeight +
        CallbackDistance * CallbackDistanceWeight;
    if (ED > MaximumDistance)
      return InvalidDistance;
    // Half the CharDistanceWeight is added to ED to simulate rounding since
    // integer division truncates the value (i.e. round-to-nearest-int instead
    // of round-to-zero).
    return Normalized ? NormalizeEditDistance(ED) : ED;
  }

用于clang/lib/Sema/SemaDecl.cpp

// Callback to only accept typo corrections that have a non-zero edit distance.
// Also only accept corrections that have the same parent decl.
class DifferentNameValidatorCCC final : public CorrectionCandidateCallback {
 public:
  DifferentNameValidatorCCC(ASTContext &Context, FunctionDecl *TypoFD,
                            CXXRecordDecl *Parent)
      : Context(Context), OriginalFD(TypoFD),
        ExpectedParent(Parent ? Parent->getCanonicalDecl() : nullptr) {}

  bool ValidateCandidate(const TypoCorrection &candidate) override {
    if (candidate.getEditDistance() == 0)
      return false;
     // ...
  }
  // ...
};

我建议查看 Chris Lattner 的这个 10 年前的 blog,了解 Clang 错误恢复机制的一般概念。

在 Clang 的 Spell Checker 上,他写道:

One of the more visible things that Clang includes is a spell checker (also on reddit). The spell checker kicks in when you use an identifier that Clang doesn't know: it checks against other close identifiers and suggests what you probably meant.

...

Clang uses the well known Levenshtein distance function to compute the best match out of the possible candidates.