windows/64 位/混合模式下的快速捕获堆栈跟踪

Question

就像你们中的大多数人可能知道的那样，存在许多不同的机制来遍历堆栈跟踪，从 windows api 开始，并继续深入到神奇的汇编世界的深处 - 让我在这里列出一些我已经研究过的链接。

首先让我提一下，我想要混合模式（托管和非托管）/64 位 + AnyCPU 应用程序和所有 windows api' 的内存泄漏分析机制s CaptureStackBackTrace 最适合我的需要，但正如我所分析的那样 - 它不支持托管代码堆栈遍历。但是那个函数 API 最接近我需要的（因为它还计算回溯哈希 - 特定调用堆栈的唯一标识符）。

我已经排除了定位内存泄漏的不同方法 - 我试过的大多数软件要么崩溃，要么工作不可靠，要么给出不好的结果。

此外，我不想重新编译现有软件并覆盖 malloc/新的其他机制 - 因为这是一项繁重的任务（我们拥有包含大量 dll 的庞大代码库）。此外，我怀疑这不是我需要执行的一次性工作 - 问题会以 1-2 年的周期返回，具体取决于编码人员和内容，所以我更愿意在应用程序本身（内存）中进行内置内存泄漏检测api hooking) 而不是一遍又一遍地与这个问题作斗争。

http://www.codeproject.com/Articles/11132/Walking-the-callstack

使用 StackWalk64 windows API 函数，但不适用于托管代码。我还不完全清楚 64 位支持 - 我已经看到一些 64 位问题的解决方法 - 我怀疑当堆栈遍历在同一线程内完成时，这段代码不能完全工作。

然后存在进程黑客： http://processhacker.sourceforge.net/

它也使用 StackWalk64，但扩展了它的回调函数（第 7 个和第 8 个参数）以支持混合模式堆栈遍历。在使用 7/8 回调函数实现了很多复杂性之后，我还设法通过混合模式支持达到了对 StackWalk64 的支持（将堆栈跟踪捕获为向量——其中每个指针都指向调用经过的程序集/dll 位置）。但正如您可能猜到的那样 - StackWalk64 的性能不足以满足我的需求 - 即使使用 C# 端的简单消息框，应用程序也只是 "hangs" 一段时间，直到它正确启动。

我还没有看到 CaptureStackBackTrace 函数调用有如此严重的延迟，所以我认为 StackWalk64 的性能不足以满足我的需求。

还存在基于 COM 的堆栈跟踪确定方法 - 如下所示： http://www.codeproject.com/Articles/371137/A-Mixed-Mode-Stackwalk-with-the-IDebugClient-Inter

http://blog.steveniemitz.com/building-a-mixed-mode-stack-walker-part-1/

但我担心的是 - 它需要 COM ，并且线程需要 com 初始化，并且由于内存 api 挂钩，我不应该在任何线程中触及 com 状态，因为它会导致更严重的问题（例如公寓初始化不正确，其他故障）

现在我已经到了 windows API 不足以满足我自己需求的地步，我需要手动遍历调用堆栈。例如可以找到这样的例子：

http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks 请参阅函数 FillStackInfo / 仅限 32 位，不支持托管代码。

有几个关于反转堆栈跟踪的提及 - 例如在以下链接中：

特别是 1、3、4 链接提供了一些有趣的夜间阅读。 :-)

但即使如此，它们也是相当有趣的机制，没有关于它们的完整工作演示示例。

我想其中一个例子是 Wine 的 dbghelp 实现（Windows "emulator" for linux），它也显示了 StackWalk64 最后是如何工作的，但我怀疑它很重要绑定到 DWARF2 文件格式可执行文件，因此它与当前 windows PE 可执行文件格式不同。

有人可以向我指出堆栈遍历的良好实现，在 64 位体系结构上工作，具有混合模式支持（可以跟踪本机和托管内存分配），这纯粹在寄存器/调用堆栈/代码分析中绑定。（1、3、4 的组合实现）

是否有人与 Microsoft 开发团队有任何良好的联系，谁可以回答这个问题？

Answer 1

x64 堆栈遍历很复杂，您已经发现了。一个简单的替代方法是干脆不做，而将困难的事情留给 OS ETW stackwalker。这行得通，而且速度比您以往任何时候都快得多。

您可以通过发出自己的 ETW 事件来利用它。在此之前，您需要为您的事件提供者启动一个 ETW 会话并为您的提供者启用堆栈遍历。 Windows 7 上有一个问题，除非托管堆栈帧都是 NGenned，否则它不起作用，因为如果 x64 ETW Stackwalker 发现一个堆栈帧不在任何加载的模块中，它就会停止，这对于 JITed 代码来说是正确的.

从 Windows 8 开始，ETW Stackwalker 将始终遍历堆栈的第一个 MB 以获取修复 JIT 问题的堆栈帧。如果 ETW 跟踪打开，JIT 编译器会为生成的代码发出 Unwind Infos，并通过 RtlAddGrowableFunctionTable 注册它，这使得首先从内核内部快速遍历堆栈成为可能。当出于兼容性原因未启用 ETW 跟踪时，事情会有所不同。

如果您在 malloc/free new/delete 内存泄漏之后，您还可以使用 OS 堆分配跟踪的 bultin 功能，该功能自 Windows 7 以来就已经存在。请参阅 xperf -help start 和 https://randomascii.wordpress.com/2015/04/27/etw-heap-tracingevery-allocation-recorded/ 了解有关堆分配跟踪的更多信息。您可以毫无问题地为已经运行的进程启用它。缺点是对于任何现实世界的应用程序，生成的数据都是巨大的。但是，如果您只进行大分配，那么它可以帮助仅跟踪 VirtualAlloc 调用，这也可以以最小的开销启用。

自 .NET 4.5 以来的托管代码也有自己的 ETW 分配跟踪提供程序，即使在 x64 Windows7 上也具有完整的堆栈遍历，因为它自己执行完整的托管堆栈遍历。更多信息可以在 CoreClr Sources 中找到： ETW::SamplingLog::SendStackTrace 在 https://github.com/dotnet/coreclr/blob/master/src/inc/eventtracebase.h 了解更多详情。

这只是可能的粗略概述。要真正获得所有必要的细节，恐怕需要整本书。而且我每天都在学习新东西。

这是一个 heapalloc.cmd 脚本，您可以使用它来跟踪堆分配。默认情况下，它会记录到一个 500MB 的环形缓冲区中，如果您的泄漏在较长时间内累积，记录所有分配堆栈而不在运行时压缩它们将不适用于 WPA。但是您可以 post 处理一个巨大的 ETL 文件并为其编写您自己的查看器。

@echo off 
setlocal enabledelayedexpansion
REM consider using a different drive for ETL output to prevent slowing down 
REM your application and to prevent lost buffers
set OUTDIR=C:\TEMP
set OUTFILENAME=HeapTracing.etl
REM Final output file
set OUTFILE=!OUTDIR!\!OUTFILENAME!
set CLRUNDOWNFILE=!OUTDIR!\clr_HeapDCend.etl
set KERNELFILE=!OUTDIR!\kernel.etl
set CLRSESSIONFILE=!OUTDIR!\clrHeapSession.etl
set HEAPUSERFILE=!OUTDIR!\HeapUserSession.etl
REM Default is allocation and realloc to track memory leaks
REM HeapFree is the other option to track double free calls
set HEAPTRACINGFLAGS=HeapAlloc+HeapRealloc 

if "%3" NEQ "" (
echo Overriding Heap Tracing Flags with: %3
set HEAPTRACINGFLAGS=%3
)


if "%1" EQU "-start" ( 
    call :StartTracing -PidNewProcess %2
    goto :Exit 
) 

if "%1" EQU "-attachPid" ( 
    call :StartTracing -Pids %2
    goto :Exit 
) 

if "%1" EQU "-startNext" (
    reg add "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /t REG_DWORD /d 1 /f
    if not %errorlevel% == 0 goto failure
    call :StartTracing -Pids 0
    goto :Exit
)

if "%1" EQU "-stop" ( 
    set XPERF_CreateNGenPdbs=1
    xperf -start ClrRundownSession -on e13c0d23-ccbc-4e12-931b-d9cc2eee27e4:0x118:5+a669021c-c450-4609-a035-5af59af4df18:0x118:5 -f "!CLRUNDOWNFILE!" -buffersize 256 -minbuffers 256 -maxbuffers 512 
    call :WaitUntilRundownCompleted "!CLRUNDOWNFILE!"
    xperf -stop -stop ClrSession ClrRundownSession HeapSession | findstr /V identifiable 2> NUL

    echo Merging profiles
    REM Reset symbol path to create the pdbs files in the output directory with in the directory with the same name like our etl file
    set TMPSYMBOLPATH=!_NT_SYMBOL_PATH!
    REM Each tool is using a different pdb cache folder. If you are using them side by side 
    REM you have to wait a long time to refresh the pdb cache. To spare the waiting time we use 
    REM the pdb cache folder from WPR

    mkdir C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 2> NUL
    set _NT_SYMBOL_PATH=srv*C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 
    mklink /D "!OUTFILE!.NGENPDB" C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache  2> NUL

    echo Managed PDBs are stored at: !OUTFILE!.NGENPDB. If you want to transfer the etl do not forget to copy this directory with the pdbs as well. 
    echo Merging ETL files and generating native pdbs

    xperf -merge  "!KERNELFILE!" "!CLRSESSIONFILE!" "!CLRUNDOWNFILE!" "!HEAPUSERFILE!" "!OUTFILE!"
    set _NT_SYMBOL_PATH=!TMPSYMBOLPATH!
    echo !OUTFILE! was created

    if "%2" NEQ "" reg delete "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /f 2> NUL
    goto :Exit 
) 

goto Usage:

:StartTracing
xperf -start ClrSession -on  Microsoft-Windows-DotNETRuntime:5 -f "!CLRSESSIONFILE!" -buffersize 128 -minbuffers 256 -maxbuffers 512 
xperf -on PROC_THREAD+LOADER+latency+virt_alloc -stackwalk VirtualAlloc  -f "%KERNELFILE%"
xperf -start HeapSession -heap %1 %2 -BufferSize 1024 -MinBuffers 128 -MaxBuffers 1024 -stackwalk %HEAPTRACINGFLAGS% -f "!HEAPUSERFILE!" -FileMode Circular -MaxFile 1024
exit /B

REM Wait until writing to ETL file has stopped by checking its file size
:WaitUntilRundownCompleted
:StillWriting
    for %%F in (%1) do set "size=%%~zF"
    timeout /T 1  > nul
    for %%F in (%1) do set "size2=%%~zF"
    if "!size!" EQU "" goto :EndWriting
    if "!size!" NEQ "!size2!" goto StillWriting
:EndWriting
timeout /T 1  > nul
exit /B


:Usage
    echo Usage: 
    echo HeapAlloc.cmd -start [executable] or -stop
    echo               -start [executable] Start a trace session 
    echo               -startNext [executable] Start heap tracing for all subsequent calls to executable.
    echo               -attachPid ddd Start a trace session for specified process
    echo               -stop  [executable] Stop a trace session 
    echo Examples
    echo     HeapAlloc.cmd -startNext devenv.exe
    echo     HeapAlloc.cmd -stop      devenv.exe
    echo To attach to a running process
    echo     HeapAlloc.cmd -attachPid dddd
    echo     HeapAlloc.cmd -stop 
    echo You must call -stop for your executable if you have used -start or startNext because heap allocation tracing will enabled until you stop it!
goto :Exit 

:failure
    echo Error occured
goto :Exit

:Exit

Answer 2

自言自语：

显然CaptureStackBackTrace可能是直接或间接调用RtlCaptureStackBackTrace，而且该函数的源代码目前显然是开源代码-可以使用"windows research kernel".

搜索

无意中收获的代码 https://github.com/dotnet/coreclr/blob/master/src/unwinder/amd64/unwinder_amd64.cpp

代码中引用了 windows 内核：

下面的所有内容都是从 minkernel\ntos\rtl\amd64\exdsptch.c 文件中借用的 Windows

通过谷歌搜索，我找到了 windows 内核本身。

也许我可以升级该功能以支持托管堆栈（使用来自进程黑客的信息）。

[4.1.2015] 通过深入分析，主要性能瓶颈似乎不是 CaptureStackBackTrace 本身 - 因为它是简单的迭代、结构查找，但托管模式堆栈遍历，我在这里调用 C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll / OutOfProcessFunctionTableCallback - 你可以找到它.net 分发中的源代码，显然它为分析 JIT 编译结构分配内存。但问题是 JIT 编译随时可能发生变化，获得可靠堆栈跟踪的唯一方法是 re-query 一遍又一遍地获取相同的信息，这会导致内存分配开销。我猜代码需要改一下，让mscordacwks类似的代码不会自己分配内存，而是使用run-time结构来确定调用栈和函数table/函数入口

P.S。如果您否决这个答案，我想知道原因，还有什么选择。如果您自己尝试过其他选择，那就更好了。

Answer 3

9-1-2015 - 我找到了被进程黑客调用的原始函数，那个是

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll OutOfProcessFunctionTableCallback

它是源代码 - 在这里： https://github.com/dotnet/coreclr/blob/master/src/debug/daccess/fntableaccess.cpp

从那里我拥有该源代码中大部分更改的所有者 - Jan Kotas (jkotas@microsoft.com) 并就此问题与他联系。

From: Jan Kotas <jkotas@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com> 
Sent: Friday, January 8, 2016 3:27 PM
Subject: RE: Fast capture stack trace on windows 64 bit / mixed mode...

...

The mscordacwks.dll is called mscordaccore.dll in CoreCLR / github repro. The VS project 
files are auto-generated for it during the build 
(\coreclr\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\mscordaccore.vcxproj).
You should be able to build and debug CoreCLR to understand how it works.
...

From: Jan Kotas <jkotas@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com> 
Sent: Saturday, January 9, 2016 2:02 AM
Subject: RE: Fast capture stack trace on windows 64 bit / mixed mode...

> I've tried to replace 
> C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll dll loading 
> with C:\Prototyping\dotNet\coreclr-master\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\Debug\mscordaccore.dll
> loading (just compiled), but if previously I could get mixed mode stack trace correctly:
> ...

mscordacwks.dll is tightly coupled with the runtime. You cannot mix and match them between runtimes.
What I meant is that you can use CoreCLR to understand how this works.

但后来他推荐了这个对我有用的解决方案：

int CaptureStackBackTrace3(int FramesToSkip, int nFrames, PVOID* BackTrace, PDWORD pBackTraceHash)
{
    CONTEXT ContextRecord;
    RtlCaptureContext(&ContextRecord);

    UINT iFrame;
    for (iFrame = 0; iFrame < nFrames; iFrame++)
    {
        DWORD64 ImageBase;
        PRUNTIME_FUNCTION pFunctionEntry = RtlLookupFunctionEntry(ContextRecord.Rip, &ImageBase, NULL);

        if (pFunctionEntry == NULL)
            break;

        PVOID HandlerData;
        DWORD64 EstablisherFrame;
        RtlVirtualUnwind(UNW_FLAG_NHANDLER,
            ImageBase,
            ContextRecord.Rip,
            pFunctionEntry,
            &ContextRecord,
            &HandlerData,
            &EstablisherFrame,
            NULL);

        BackTrace[iFrame] = (PVOID)ContextRecord.Rip;
    }

    return iFrame;
}

此代码片段仍然缺少回溯哈希计算，但可以在之后添加。

还需要注意的是，在调试此代码片段时，您应该使用本机调试，而不是混合模式（C# 项目默认使用混合模式），因为它会以某种方式干扰调试器中的堆栈跟踪。（弄清楚这种失真是如何以及为什么会发生的）

仍然缺少一块拼图 - 如何使符号解析完全抵抗 FreeLibrary / Jit 代码处理，但这是我仍然需要弄清楚的事情。

请注意，RtlVirtualUnwind 很可能只适用于 64 位架构，不适用于 arm 或 32 位。

更有趣的是存在函数RtlCaptureStackBackTrace 这在某种程度上类似于 windows api 函数 CaptureStackBackTrace - 但它们在某种程度上有所不同 - 至少通过命名。此外，如果您检查 RtlCaptureStackBackTrace - 它最终会调用 RtlVirtualUnwind - 您可以从 Windows Research Kernel source codes

检查它

RtlCaptureStackBackTrace
>
RtlWalkFrameChain
>
RtlpWalkFrameChain
>
RtlVirtualUnwind

但我测试过的 RtlCaptureStackBackTrace 无法正常工作。与上面的函数 RtlVirtualUnwind 不同。

这有点神奇。 :-)

我将继续此调查问卷的第 2 阶段问题 - 在此处：

Answer 4

顺便说一句 - 如果有人缺少 windows 的 StackWalk 原始实现，它位于此处：

https://github.com/dotnet/coreclr/blob/master/src/utilcode/stacktrace.cpp

Answer 5

25.1.2016 作为单独的问题写作，作为补充信息。

对于堆栈唯一 ID，CaptureStackBackTrace 使用所有指令指针的简单求和 - 想法是从以下借用的："Windows_Research_Kernel(sources)\WRK-v1.2\base\ntos\rtl\amd64\stkwalk.c":

    size_t hashValue = 0;

    for (int i = 0; i < nFrames; i++)
        hashValue += PtrToUlong(BackTrace[i]);

    *pBackTraceHash = (DWORD)hashValue;

我不确定最后一个转换 - 一些将最后一个参数指定为 DWORD，一些指定为 ulong64，但这不相关。这种计算的主要问题是它不够独特。对于递归函数调用的情况 - 如果您有调用顺序：

func1
func2
func3

堆栈跟踪：

func1
func3
func2

将完全相同。

我调试的内容 - 对于内存泄漏检测，我得到 62876 次错误命中 - 唯一堆栈 ID 计算不够可靠。

我已经将公式改为：

static DWORD crc32_tab[] =
{
    0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
    0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
    0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
    0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
    0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
    0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
    0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
    0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
    0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
    0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
    0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
    0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
    0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
    0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
    0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
    0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
    0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
    0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
    0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
    0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
    0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
    0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
    0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
    0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
    0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
    0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
    0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
    0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
    0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
    0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
    0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
    0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
    0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
    0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
    0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
    0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
    0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
    0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
    0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
    0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
    0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
    0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
    0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};

if (pBackTraceHash)
{

    size_t hashValue = 0;
    for( int idxFrame = 0; idxFrame < (int)iFrame; idxFrame++ )
    {
        unsigned char* p = (unsigned char*)&BackTrace[idxFrame];
        for( int i = 0; i < sizeof(void*); i++ )
            hashValue = crc32_tab[ ((hashValue ^ *p++) & 0xFF) ] ^ (hashValue >> 8);
    }
    *pBackTraceHash = (DWORD)hashValue;
}

该算法不会给出错误命中，但会稍微减慢执行速度。

内存泄漏统计数据也不同：不可靠的算法：泄漏内存总量：48'874'764 / 在 371 个分配池中基于 Crc32 的算法：泄漏内存总量：48'874'764 / 在 614 个分配池中

如您所见 - 统计信息将相似的调用堆栈组合（合并）在一起 - 碎片较少，但原始调用堆栈丢失。（统计不正确）

有人可以为此提供一些更快的算法吗？

Answer 6

27.1.2016 并且可能直接出问题——是32位的调用栈判断。我问过要使用哪个 API - 至少 CaptureStackBackTrace 会产生不完整的遍历（仅限本机代码），并且 RtlVirtualUnwind api 函数对于 32 位 windows 不存在。

From: Noah Falk <noahfalk@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com>; Mike McLaughlin <mikem@microsoft.com> 
Cc: Jan Kotas <jkotas@microsoft.com>
Sent: Tuesday, January 26, 2016 1:34 AM
Subject: RE: Resolving managed call stack from void*

Hi Tarmo, hope the exploration of stackwalking has been interesting. 
If I followed you correctly you’ve been successful on x64 but hoping you can extend your technique to 32 bit. 
Indeed the RtlCaptureVirtualUnwind techniques don’t work here, and the fundamental reason behind it is that 
while x64 defines a specific calling convention that all code on Windows is forced to use, x86 does not. 
This means that there is no algorithm the OS could implement which guarantees correct unwinding when PDBs are 
unavailable. However you do have some options:

1)      You can use simple heuristics that work for certain kinds of code. 
Unoptimzed code on x86 often uses EBP chaining, in which ESP in the current frame points to EBP, and EBP points 
to the parent frame’s EBP, and so on down the stack. The return address is stored on the stack adjacent to EBP. 
As I recall all jitted code produced by recent versions of .Net follows these conventions, including optimized 
jitted code. However when a compiler performs inlining these conventions will be unable to detect it, and optimized 
code that does not follow this convention could easily cause the stack to become unwalkable.

2)      If you are willing to load PDBs you can use the DIA APIs to walk the stack: 
https://msdn.microsoft.com/en-us/library/dt06fh94.aspx. The PDB contains additional data about optimized code 
which allows frames that do not follow the EBP chaining convention to be correctly unwound. 
This is the stack walk API that Visual Studio is using when it debugs 32 bit native code on Windows.

3)      The ICorDebug APIs (https://msdn.microsoft.com/en-us/library/dd646502(v=vs.110).aspx) are a set of 
APIs that are designed to support managed code debuggers. Starting in .Net 4.0 the ICorDebug API supports 
dump debugging, however the API is designed in such a way that you don’t have to serialize a dump file. 
This is likely to be more complicated than you would want, but its supported to the use the Windows process 
snapshot APIs to take a snapshot of the memory space and then direct the ICorDebug API to read from this 
snapshot as if it was a dump. One advantage of the ICorDebug API is that not only will it give you managed 
stack frames, it also allows exporing all the other kinds of data debuggers would expose such as parameters, 
local values, fields of objects, types of the values, etc.

The MDbg tool (https://www.microsoft.com/en-us/download/details.aspx?id=2282) is a complete sample debugger 
with source included. It supports dump debugging and displaying callstacks, though it won’t have any specific 
example about using the process snapshot APIs in place of using a dump. The main change would be replacing 
the implementation of ICorDebugDataTarget. MDbg has an implementation that reads from a dump file and you 
would need to create a new implementation that reads from a process snapshot using the windows APIs 
(https://msdn.microsoft.com/en-us/library/dn457825(v=vs.85).aspx). I’ve never written the code myself and 
I’ve heard from other tool authors that they found using the windows snapshot APIs more difficult than expected,
 but eventually they were successful.

我受到方法 1 的启发，因为已经在另一个项目中看到类似的方法，所以我编写了自己的 32 位堆栈遍历实现：

int CaptureStackBackTracePro( int FramesToSkip, int nFrames, PVOID* BackTrace, PDWORD pBackTraceHash )
{
    //
    //  This approach was taken from StackInfoManager.cpp / FillStackInfo
    //  http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks
    //  - slightly simplified the function itself.
    //
    int regEBP;
    __asm mov regEBP, ebp;

    long *pFrame = (long*) regEBP;              // pointer to current function frame
    void* pNextInstruction;
    int iFrame = 0;

    //
    // Using __try/_catch is faster than using ReadProcessMemory or VirtualProtect.
    // We return whatever frames we have collected so far after exception was encountered.
    //
    __try {
        for( ; iFrame < nFrames; iFrame++ )
        {
            pNextInstruction = (void*)(*(pFrame + 1));

            if( !pNextInstruction )     // Last frame
                break;

            BackTrace[iFrame] = pNextInstruction;
            pFrame = (long*)(*pFrame);
        }
    }
    __except(EXCEPTION_EXECUTE_HANDLER) 
    {
    }

    // pBackTraceHash fillout is missing, see in another answer code snipet.

    return iFrame;

} //CaptureStackBackTracePro

简短测试表明此函数能够捕获本机和托管堆栈帧。

我想优化代码需要更深入的分析。最好省略优化或仅优化代码的相关部分 - 为了更好的诊断？！

windows/64 位/混合模式下的快速捕获堆栈跟踪

Fast capture stack trace on windows / 64-bit / mixed mode

windows

mixed-mode

memory-leaks

stack-trace