python 中的连锁记忆器

Chain memoizer in python

我已经有了一个很好用的记忆器。它使用泡菜转储序列化输入并创建 MD5 哈希作为密钥。函数结果非常大,存储为 pickle 文件,文件名为 MD5 哈希。当我依次调用两个记忆函数时,memoizer 将加载第一个函数的输出并将其传递给第二个函数。第二个函数将序列化它,创建 MD5 然后加载输出。这是一个非常简单的代码:

@memoize
def f(x):
    ...
    return y

@memoize
def g(x):
    ...
    return y

y1 = f(x1)
y2 = g(y1)

y1 在计算 f 时从磁盘加载,然后在计算 g 时序列化。是否可以以某种方式绕过此步骤并将 y1 的密钥(即 MD5 哈希)传递给 g?如果 g 已经有这个密钥,它会从磁盘加载 y2。如果没有,它 "requests" 完整的 y1 用于评估 g

编辑:

import cPickle as pickle
import inspect
import hashlib

class memoize(object):
    def __init__(self, func):
        self.func = func

    def __call__(self, *args, **kwargs):
        arg = inspect.getargspec(self.func).args
        file_name = self._get_key(*args, **kwargs)
        try:
            f = open(file_name, "r")
            out = pickle.load(f)
            f.close()
        except:
            out = self.func(*args, **kwargs)
            f = open(file_name, "wb")
            pickle.dump(out, f, 2)
            f.close()

        return out

    def _arg_hash(self, *args, **kwargs):
        _str = pickle.dumps(args, 2) + pickle.dumps(kwargs, 2)
        return hashlib.md5(_str).hexdigest()

    def _src_hash(self):
        _src = inspect.getsource(self.func)
        return hashlib.md5(_src).hexdigest()

    def _get_key(self, *args, **kwargs):
        arg = self._arg_hash(*args, **kwargs)
        src = self._src_hash()
        return src + '_' + arg + '.pkl'

我认为您可以自动执行此操作,但我通常认为最好明确说明 "lazy" 评估。因此,我将介绍一种方法,为您的记忆函数添加一个额外的参数:lazy。但是我将稍微简化帮助程序,而不是文件、pickle 和 md5:

# I use a dictionary as storage instead of files
storage = {}

# No md5, just hash
def calculate_md5(obj):
    print('calculating md5 of', obj)
    return hash(obj)

# create dictionary entry instead of pickling the data to a file
def create_file(md5, data):
    print('creating file for md5', md5)
    storage[md5] = data

# Load dictionary entry instead of unpickling a file
def load_file(md5):
    print('loading file with md5 of', md5)
    return storage[md5]

我使用自定义 class 作为中间对象:

class MemoizedObject(object):
    def __init__(self, md5):
        self.md5 = result_md5

    def get_real_data(self):
        print('load...')
        return load_file(self.md5)

    def __repr__(self):
        return '{self.__class__.__name__}(md5={self.md5})'.format(self=self)

最后,我展示了更改后的 Memoize,假设您的函数只接受一个参数:

class Memoize(object):
    def __init__(self, func):
        self.func = func
        # The md5 to md5 storage is needed to find the result file 
        # or result md5 for lazy evaluation.
        self.md5_to_md5_storage = {}

    def __call__(self, x, lazy=False):
        # If the argument is a memoized object no need to
        # calculcate the hash, we can just look it up.
        if isinstance(x, MemoizedObject):
            key = x.md5
        else:
            key = calculate_md5(x)

        if lazy and key in self.md5_to_md5_storage:
            # Check if the key is present in the md5 to md5 storage, otherwise
            # we can't be lazy
            return MemoizedObject(self.md5_to_md5_storage[key])
        elif not lazy and key in self.md5_to_md5_storage:
            # Not lazy but we know the result
            result = load_file(self.md5_to_md5_storage[key])
        else:
            # Unknown argument
            result = self.func(x)
            result_md5 = calculate_md5(result)
            create_file(result_md5, result)
            self.md5_to_md5_storage[key] = result_md5
        return result

现在,如果您调用您的函数并在正确的位置指定惰性,您可以避免加载(unpickling)您的文件:

@Memoize
def f(x):
    return x+1

@Memoize
def g(x):
    return x+2

正常(第一)运行:

>>> x1 = 10
>>> y1 = f(x1)
calculating md5 of 10
calculating md5 of 11
creating file for md5 11
>>> y2 = g(y1)
calculating md5 of 11
calculating md5 of 13
creating file for md5 13

没有lazy:

>>> x1 = 10
>>> y1 = f(x1)
calculating md5 of 10
loading file with md5 of 11
>>> y2 = g(y1)
calculating md5 of 11
loading file with md5 of 13

lazy=True

>>> x1 = 10
>>> y1 = f(x1, lazy=True)
calculating md5 of 10
>>> y2 = g(y1)
loading file with md5 of 13

最后一个选项只计算第一个参数的 "md5" 并加载 end-result 的文件。那应该正是你想要的。