为什么调用动态库函数这么慢？

Question

我正在写一个供 python 调用的共享库。由于这是我第一次使用 python 的 ctypes 模块，并且几乎是我第一次编写共享库，所以我一直在编写 C 和 python 代码来调用库的函数。

我把一些计时代码放进去，发现虽然 C 程序对库的大多数调用都非常快，但第一个很慢，比 python 中的对应项慢得多事实。这违背了我的预期，并希望有人能告诉我原因。

这是我的 C 库中头文件的精简版本。

typedef struct MdaDataStruct
{
    int numPts;
    int numDists;
    float* data;
    float* dists;
} MdaData;

//allocate the structure
void* makeMdaStruct(int numPts, int numDist);

//deallocate the structure
void freeMdaStruct(void* strPtr);

//assign the data array
void setData(void* strPtr, float* divData);

调用函数的C程序如下：

int main(int argc, char* argv[])
{
    clock_t t1, t2;
    t1=clock();
    long long int diff;
    //test the allocate function
    t1 = clock();
    MdaData* dataPtr = makeMdaStruct(10, 3);
    t2 = clock();
    diff = (((t2-t1)*1000000)/CLOCKS_PER_SEC);
    printf("make struct, took: %d microseconds\n", diff);

    //make some data
    float testArr[10] = {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9};

    //test the set data function
    t1 = clock();
    setData(dataPtr, testArr);
    t2 = clock();
    diff = (((t2-t1)*1000000)/CLOCKS_PER_SEC);
    printf("set data, took: %d microseconds\n", diff);

    //test the deallocate function
    t1 = clock();
    freeMdaStruct(dataPtr);
    t2 = clock();
    diff = (((t2-t1)*1000000)/CLOCKS_PER_SEC);
    printf("free struct, took: %d microseconds\n", diff);

    //exit
    return 0;
}

这里是调用函数的 python 脚本：

# load the library
t1 = time.time()
cs_lib = cdll.LoadLibrary("./libChiSq.so")
t2 = time.time()
print "load library, took", int((t2-t1)*1000000), "microseconds"
# tell python the function will return a void pointer
cs_lib.makeMdaStruct.restype = c_void_p
# make the strcuture to hold the MdaData with 50 data points and 8 L dists
t1 = time.time()
mdaData = cs_lib.makeMdaStruct(10,3)
t2 = time.time()
print "make struct, took", int((t2-t1)*1000000), "microseconds"
# make an array with the test data
divDat = np.array([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], np.float32)
#run the function to load the array into the struct
t1 = time.time()
cs_lib.setData(mdaData, divDat.ctypes.data)
t2 = time.time()
print "set data, took", int((t2-t1)*1000000), "microseconds"
#free the structure
t1 = time.time()
cs_lib.freeMdaStruct(mdaData)
t2 = time.time()
print "free struct, took", int((t2-t1)*1000000), "microseconds"

最后，这是运行两个连续的输出：

[]$ ./tester
make struct, took: 60 microseconds
set data, took: 2 microseconds
free struct, took: 2 microseconds
[]$ python so_py_tester.py 
load library, took 77 microseconds
make struct, took 3 microseconds
set data, took 23 microseconds
free struct, took 10 microseconds

如您所见，对 makeMdaStruct 的 C 调用需要 60us，对 makeMdaStruct 的 python 调用需要 3us，这非常令人困惑。

我最好的猜测是 C 代码以某种方式支付了在第一次调用时加载库的成本？这让我很困惑，因为我认为程序加载到内存时库已经加载了。

编辑： 我认为猜测可能有一定道理，因为我在定时调用 makeMdaStruct 之前对 makeMdaStruct 和 freeMdaStruct 进行了额外的不定时调用，得到了以下结果测试输出：

[]$ ./tester
make struct, took: 1 microseconds
set data, took: 1 microseconds
free struct, took: 0 microseconds
[]$ python so_py_tester.py 
load library, took 70 microseconds
make struct, took 4 microseconds
set data, took 23 microseconds
free struct, took 12 microseconds

Answer 1

My best guess was that somehow the C code pays the cost of loading the library at the first call? Which confuses me because I thought that the library was loaded when the program was loaded into memory.

你在这两种情况下都是正确的。加载程序时，库被加载。但是，动态 loader/linker 将 符号解析 推迟到函数调用时间。

通过 过程 linkage table (PLT) 中的条目间接完成对共享库的调用。最初，PLT 中的所有条目都指向 ld.so。第一次调用函数时，ld.so 查找符号的实际地址，更新 PLT 中的条目，然后跳转到该函数。这是 "lazy" 符号解析。

您可以设置LD_BIND_NOW 环境变量来更改此行为。来自 ld.so(8):

LD_BIND_NOW (libc5; glibc since 2.1.1) If set to a nonempty string, causes the dynamic linker to resolve all symbols at program startup instead of deferring function call resolution to the point when they are first referenced. This is useful when using a debugger.

此行为也可以在 link 时更改。来自 ld(1):

  -z keyword
      The recognized keywords are:
      ...
      lazy
           When generating an executable or shared library, mark it to
           tell the dynamic linker to defer function call resolution to
           the point when the function is called (lazy binding), rather
           than at load time.  Lazy binding is the default.

进一步阅读：

http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/

为什么调用动态库函数这么慢？

Why is this call to a dynamic library function so slow?

c

python

ctypes

shared-libraries