使用 ctypes 将 C 函数导入 python 代码时如何抑制内存爆炸?

how to suppress the memory explosion when importing C funcs into python code using ctypes?

#include<Python.h>

PyObject *getFeature(wchar_t *text);
// where the unigram is a Set Object with type 'PySetObject'
#include<test.h>

PyObject *getFeature(wchar_t *text)
{
    int ret = -1;
    PyObject *featureList = PyList_New(0);

    PyObject *curString = PyUnicode_FromWideChar(text, 2);
    ret = PyList_Append(featureList, curString);
    Py_DECREF(curString);
    return featureList;
}

然后我编译它并得到一个名为 libtest.so 的共享库。所以我可以将这个 C .so 文件导入到 python 代码中,使用如下所示的 ctypes:

import ctypes

dir_path = 'path/to/the/libtest.so'
feature_extractor = ctypes.PyDLL(
    os.path.join(dir_path, 'libtest.so'))
get_feature_c = feature_extractor.getFeature
get_feature_c.argtypes = [
    ctypes.c_wchar_p, ctypes.py_object]
get_feature_c.restype = ctypes.py_object

def get_feature(text):
    return [text[:2]]

times = 100000
for i in range(times):
    res = get_feature_c('ncd')  # the memory size will become larger and larger.

for i in range(times):
    res = get_feature('ncd')  # the memory will remain in a fixed size.

您的示例中的代码没有泄漏:

#include<test.h>

PyObject *getFeature(wchar_t *text)
{
    int ret = -1;
    PyObject *featureList = PyList_New(0);

    // Create new reference to "curString" (allcates memory)
    PyObject *curString = PyUnicode_FromWideChar(text, 2);

    // Add "curString" to "featureList", incrementing reference count
    ret = PyList_Append(featureList, curString);

    // "curString" no longer used, reduce reference count.
    Py_DECREF(curString);

    // Correctly returns a single reference to the list,
    // which contains a single reference to a string
    return featureList;
}

res 是 re-assigned get_feature_c 的 return 值时,res(列表)的前一个值减少了引用计数。如果该计数为零(它是),则列表中每个项目的引用也将递减,如果对象的引用变为零,则对象将被释放,然后列表对象也将被释放。

但是在你引用的C code中,由于没有调用Py_DECREF,所以有很多漏洞。当您泄漏引用时,对象的引用计数永远不会达到零并且永远不会被释放,从而造成内存泄漏:

// Create a new object with "PyUnicode_FromWideChar",
// Add another reference via "featureList",
// so leaked reference to the object.
ret = PyList_Append(featureList, PyUnicode_FromWideChar(charCurrentFeature, 2));

也在这里:

PyObject *bigrams1 = PySet_New(0);
// each "PyUnicode_FromWideChar" leaks a reference.
ret = PySet_Add(unigrams1, PyUnicode_FromWideChar(L"据", 1));
ret = PySet_Add(unigrams1, PyUnicode_FromWideChar(L"nc", 2));
ret = PySet_Add(unigrams1, PyUnicode_FromWideChar(L"ckd", 3));
ret = PySet_Add(unigrams1, PyUnicode_FromWideChar(L"nc.3e", 5));

您可以使用测试 DLL 的调试版本和 Python 的调试版本来测试代码是否泄漏引用。我将使用 Windows build:

进行演示

test.c - 使用 Microsoft Visual Studio
编译的调试版本 cl /LD /MDd /W3 /Ic:\python310\include test.c -link /libpath:c:\python310\libs

#ifdef _WIN32
#   define API __declspec(dllexport)
#else
#   define API
#endif

#include <Python.h>

API PyObject *getFeature(wchar_t *text)
{
    int ret = -1;
    PyObject *featureList = PyList_New(0);

    PyObject *curString = PyUnicode_FromWideChar(text, 2);  // allocates curString (1st reference)
    ret = PyList_Append(featureList, curString);  // Creates 2nd reference to curString in featureList
    Py_DECREF(curString); // curString no longer used
    return featureList;
}

test.py

import ctypes as ct
import sys

feature_extractor = ct.PyDLL('./test')
get_feature_c = feature_extractor.getFeature
get_feature_c.argtypes = ct.c_wchar_p, # OP example code had error here
get_feature_c.restype = ct.py_object

def get_feature(text):
    return [text[:2]]

times = 10
for i in range(times):
    print(sys.gettotalrefcount()) # Only available in debug build of Python
    res = get_feature_c('ncd')

使用 Python 的调试版本 运行 时的输出以启用 sys.gettotalrefcount(),并注意总引用计数不会在循环中增长:

C:\>python_d test.py
70904
70910
70910
70910
70910
70910
70910
70910
70910
70910

现在 Py_DECREF 注释掉了每个循环中泄漏的引用:

70904
70911
70912
70913
70914
70915
70916
70917
70918
70919