Clang 的 "did you mean ...?" 变量名校正算法是如何工作的？

Question

我正在用 Clang 编译 C++ 代码。 (Apple clang version 12.0.5 (clang-1205.0.22.11)).

Clang 可以在您拼错变量时提供提示：

#include <iostream>

int main() {
    int my_int;
    std::cout << my_it << std::endl;
}

spellcheck-test.cpp:5:18: error: use of undeclared identifier 'my_it'; did you mean 'my_int'?
    std::cout << my_it << std::endl;
                 ^~~~~
                 my_int
spellcheck-test.cpp:4:9: note: 'my_int' declared here
    int my_int;
        ^
1 error generated.

我的问题是：

What is the criterion Clang uses to determine when to suggest another variable?

我的实验表明它非常复杂：

如果您可能指的是另一个类似命名的变量（例如 int my_in;），它不会给出建议
如果建议变量的操作类型错误（例如，尝试打印 my_it.size()）它不会给出建议
是否给出建议取决于对变量名的非平凡比较：它允许删除和插入字符，并且更长的变量名允许更多insertion/deletions被认为是“相似的” ".

Answer 1

您可能不会找到它的文档，但由于 Clang 是开源的，您可以求助于源代码以尝试弄明白。

Clangd?

特定诊断（来自 DiagnosticSemaKinds.td）：

def err_undeclared_var_use_suggest : Error<
  "use of undeclared identifier %0; did you mean %1?">;

仅从 clang-tools-extra/clangd/IncludeFixer.cpp:

引用

  // Try to fix unresolved name caused by missing declaration.
  // E.g.
  //   clang::SourceManager SM;
  //          ~~~~~~~~~~~~~
  //          UnresolvedName
  //   or
  //   namespace clang {  SourceManager SM; }
  //                      ~~~~~~~~~~~~~
  //                      UnresolvedName
  // We only attempt to recover a diagnostic if it has the same location as
  // the last seen unresolved name.
  if (DiagLevel >= DiagnosticsEngine::Error &&
      LastUnresolvedName->Loc == Info.getLocation())
    return fixUnresolvedName();

现在，clangd 是语言服务器，t.b.h。我不知道 Clang 编译器前端是否实际使用它来产生某些诊断，但您可以自由地继续探索，将这些细节联系在一起。上面的fixUnresolvedName最终进行了模糊搜索：

if (llvm::Optional<const SymbolSlab *> Syms = fuzzyFindCached(Req))
    return fixesForSymbols(**Syms);

如果您想深入了解细节，我建议您从 the fuzzyFindCached function:

开始

llvm::Optional<const SymbolSlab *>
IncludeFixer::fuzzyFindCached(const FuzzyFindRequest &Req) const {
  auto ReqStr = llvm::formatv("{0}", toJSON(Req)).str();
  auto I = FuzzyFindCache.find(ReqStr);
  if (I != FuzzyFindCache.end())
    return &I->second;

  if (IndexRequestCount >= IndexRequestLimit)
    return llvm::None;
  IndexRequestCount++;

  SymbolSlab::Builder Matches;
  Index.fuzzyFind(Req, [&](const Symbol &Sym) {
    if (Sym.Name != Req.Query)
      return;
    if (!Sym.IncludeHeaders.empty())
      Matches.insert(Sym);
  });
  auto Syms = std::move(Matches).build();
  auto E = FuzzyFindCache.try_emplace(ReqStr, std::move(Syms));
  return &E.first->second;
}

连同其单一函数参数的类型，FuzzyFindRequest in clang/index/Index.h:

struct FuzzyFindRequest {
  /// A query string for the fuzzy find. This is matched against symbols'
  /// un-qualified identifiers and should not contain qualifiers like "::".
  std::string Query;
  /// If this is non-empty, symbols must be in at least one of the scopes
  /// (e.g. namespaces) excluding nested scopes. For example, if a scope "xyz::"
  /// is provided, the matched symbols must be defined in namespace xyz but not
  /// namespace xyz::abc.
  ///
  /// The global scope is "", a top level scope is "foo::", etc.
  std::vector<std::string> Scopes;
  /// If set to true, allow symbols from any scope. Scopes explicitly listed
  /// above will be ranked higher.
  bool AnyScope = false;
  /// The number of top candidates to return. The index may choose to
  /// return more than this, e.g. if it doesn't know which candidates are best.
  llvm::Optional<uint32_t> Limit;
  /// If set to true, only symbols for completion support will be considered.
  bool RestrictForCodeCompletion = false;
  /// Contextually relevant files (e.g. the file we're code-completing in).
  /// Paths should be absolute.
  std::vector<std::string> ProximityPaths;
  /// Preferred types of symbols. These are raw representation of `OpaqueType`.
  std::vector<std::string> PreferredTypes;

  bool operator==(const FuzzyFindRequest &Req) const {
    return std::tie(Query, Scopes, Limit, RestrictForCodeCompletion,
                    ProximityPaths, PreferredTypes) ==
           std::tie(Req.Query, Req.Scopes, Req.Limit,
                    Req.RestrictForCodeCompletion, Req.ProximityPaths,
                    Req.PreferredTypes);
  }
  bool operator!=(const FuzzyFindRequest &Req) const { return !(*this == Req); }
};

其他兔子洞？

以下提交可能是另一条起点：

[Frontend] Allow attaching an external sema source to compiler instance and extra diags to TypoCorrections

This can be used to append alternative typo corrections to an existing diag. include-fixer can use it to suggest includes to be added.

Differential Revision: https://reviews.llvm.org/D26745

从中我们可能会得到 clang/include/clang/Sema/TypoCorrection.h，这听起来像是编译器前端比（clang 额外工具）clangd 更合理地使用的功能。例如：

  /// Gets the "edit distance" of the typo correction from the typo.
  /// If Normalized is true, scale the distance down by the CharDistanceWeight
  /// to return the edit distance in terms of single-character edits.
  unsigned getEditDistance(bool Normalized = true) const {
    if (CharDistance > MaximumDistance || QualifierDistance > MaximumDistance ||
        CallbackDistance > MaximumDistance)
      return InvalidDistance;
    unsigned ED =
        CharDistance * CharDistanceWeight +
        QualifierDistance * QualifierDistanceWeight +
        CallbackDistance * CallbackDistanceWeight;
    if (ED > MaximumDistance)
      return InvalidDistance;
    // Half the CharDistanceWeight is added to ED to simulate rounding since
    // integer division truncates the value (i.e. round-to-nearest-int instead
    // of round-to-zero).
    return Normalized ? NormalizeEditDistance(ED) : ED;
  }

用于clang/lib/Sema/SemaDecl.cpp：

// Callback to only accept typo corrections that have a non-zero edit distance.
// Also only accept corrections that have the same parent decl.
class DifferentNameValidatorCCC final : public CorrectionCandidateCallback {
 public:
  DifferentNameValidatorCCC(ASTContext &Context, FunctionDecl *TypoFD,
                            CXXRecordDecl *Parent)
      : Context(Context), OriginalFD(TypoFD),
        ExpectedParent(Parent ? Parent->getCanonicalDecl() : nullptr) {}

  bool ValidateCandidate(const TypoCorrection &candidate) override {
    if (candidate.getEditDistance() == 0)
      return false;
     // ...
  }
  // ...
};

Answer 2

我建议查看 Chris Lattner 的这个 10 年前的 blog，了解 Clang 错误恢复机制的一般概念。

在 Clang 的 Spell Checker 上，他写道：

One of the more visible things that Clang includes is a spell checker (also on reddit). The spell checker kicks in when you use an identifier that Clang doesn't know: it checks against other close identifiers and suggests what you probably meant.

...

Clang uses the well known Levenshtein distance function to compute the best match out of the possible candidates.

Clang 的 "did you mean ...?" 变量名校正算法是如何工作的？

How does Clang's "did you mean ...?" variable name correction algorithm work?

c++

compiler-errors

clang

Clangd?

其他兔子洞？