一个将字符串元组作为 return 整数的记忆函数？

Question

假设我有这样的元组数组：

a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]

我正在尝试将这些数组转换为数值向量，每个维度代表一个特征。

所以我们的预期输出是这样的：

amod = [1, 0, 1]  # or [1, 1, 1]
bmod = [1, 1, 2]  # or [1, 2, 2]

所以创建的矢量取决于它之前看到的内容（即矩形仍然编码为 1 但新值 'large' 被编码为下一步 2).

我想我可以结合使用 yield 和记忆功能来帮助我解决这个问题。这是我到目前为止尝试过的：

def memoize(f):
    memo = {}
    def helper(x):
        if x not in memo:
            memo[x] = f(x)
            return memo[x]
        return helper

@memoize
def verbal_to_value(tup):
    u = 1
    if tup[0] == 'shape':
        yield u
        u += 1
    if tup[0] == 'fill':
        yield u
        u += 1
    if tup[0] == 'size':
        yield u
        u += 1

但是我一直收到这个错误：

TypeError: 'NoneType' object is not callable

有没有一种方法可以创建这个函数，它可以记住它所看到的内容？如果它可以动态添加键，那么奖励积分，这样我就不必硬编码 'shape' 或 'fill'.

之类的东西

Answer 1

不是最好的方法，但可能会帮助您找出更好的解决方案

class Shape:
    counter = {}
    def to_tuple(self, tuples):
        self.tuples = tuples
        self._add()
        l = []
        for i,v in self.tuples:
            l.append(self.counter[i][v])
        return l


    def _add(self):
        for i,v in self.tuples:
            if i in self.counter.keys():
                if v not in self.counter[i]:
                    self.counter[i][v] = max(self.counter[i].values()) +1
            else:
                self.counter[i] = {v: 0}

a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]

b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]   

s = Shape()
s.to_tuple(a)
s.to_tuple(b)

Answer 2

首先：这是我最喜欢的 memoize 实现装饰器，主要是因为速度...

def memoize(f):
    class memodict(dict):
        __slots__ = ()
        def __missing__(self, key):
            self[key] = ret = f(key)
            return ret
    return memodict().__getitem__

除了一些极端情况外，它与您的效果相同：

def memoize(f):
    memo = {}
    def helper(x):
        if x not in memo:
            memo[x] = f(x)
        #else:
        #    pass
        return memo[x]
    return helper

但速度稍快，因为 if x not in memo: 发生在本机代码而不是 python。要理解它，你只需要要知道一般情况下：解释adict[item] python 调用 adict.__getitem__(key)，如果 adict 不包含键， __getitem__() 调用 adict.__missing__(key) 所以我们可以利用 python 魔术方法协议为我们的利益...

#This the first idea I had how I would implement your
#verbal_to_value() using memoization:
from collections import defaultdict

work=defaultdict(set)

@memoize 
def verbal_to_value(kv):
    k, v = kv
    aset = work[k]  #work creates a new set, if not already created.
    aset.add(v)     #add value if not already added
    return len(aset)

包括 memoize 装饰器，那是 15 行代码...

#test suite:

def vectorize(alist):
    return [verbal_to_value(kv) for kv in alist]

a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]

print (vectorize(a)) #shows [1,1,1]
print (vectorize(b)) #shows [1,2,2]

defaultdict 是一个功能强大的对象，具有几乎相同的逻辑 as memoize：各方面的标准词典，除了当查找失败，它运行回调函数来创建丢失的价值。在我们的例子中 set()

不幸的是，这个问题需要访问元组被用作键，或字典状态本身。随着结果我们不能只为 .default_factory

编写一个简单的函数

但是我们可以根据memoize/defaultdict模式写一个新的对象：

#This how I would implement your verbal_to_value without
#memoization, though the worker class is so similar to @memoize,
#that it's easy to see why memoize is a good pattern to work from:
class sloter(dict):
    __slots__ = ()
    def __missing__(self,key):
        self[key] = ret = len(self) + 1
        #this + 1 bothers me, why can't these vectors be 0 based? ;)
        return ret

from collections import defaultdict
work2 = defaultdict(sloter)
def verbal_to_value2(kv):
    k, v = kv
    return work2[k][v]
#~10 lines of code?




#test suite2:

def vectorize2(alist):
    return [verbal_to_value2(kv) for kv in alist]

print (vectorize2(a)) #shows [1,1,1]
print (vectorize2(b)) #shows [1,2,2]

你以前可能见过类似 sloter 的东西，因为它是有时恰好用于这种情况。转换会员名称到数字并返回。正因为如此，我们的优势在于能够扭转这样的事情：

def unvectorize2(a_vector, pattern=('shape','fill','size')):
    reverser = [{v:k2 for k2,v in work2[k].items()} for k in pattern]
    for index, vect in enumerate(a_vector):
        yield pattern[index], reverser[index][vect]

print (list(unvectorize2(vectorize2(a))))
print (list(unvectorize2(vectorize2(b))))

但我在你原来的 post 中看到了那些产量，他们已经得到了我思考......如果有一个 memoize / defaultdict 之类的对象怎么办可以使用生成器而不是函数，并且只知道推进发电机而不是调用它。然后我意识到... 是的，生成器带有一个名为 __next__() 的可调用对象，它意味着我们不需要一个新的 defaultdict 实现，只需要一个仔细提取正确的成员函数...

def count(start=0): #same as: from itertools import count
    while True:
        yield start
        start += 1

#so we could get the exact same behavior as above, (except faster)
#by saying:
sloter3=lambda :defaultdict(count(1).__next__)
#and then
work3 = defaultdict(sloter3)
#or just:
work3 = defaultdict(lambda :defaultdict(count(1).__next__))
#which yes, is a bit of a mindwarp if you've never needed to do that
#before.

#the outer defaultdict interprets the first item. Every time a new
#first item is received, the lambda is called, which creates a new
#count() generator (starting from 1), and passes it's .__next__ method
#to a new inner defaultdict.

def verbal_to_value3(kv):
    k, v = kv
    return work3[k][v]
#you *could* call that 8 lines of code, but we managed to use
#defaultdict twice, and didn't need to define it, so I wouldn't call
#it 'less complex' or anything.



#test suite3:
def vectorize3(alist):
    return [verbal_to_value3(kv) for kv in alist]

print (vectorize3(a)) #shows [1,1,1]
print (vectorize3(b)) #shows [1,2,2]

#so yes, that can also work.

#and since the internal state in `work3` is stored in the exact same
#format, it be accessed the same way as `work2` to reconstruct input
#from output.
def unvectorize3(a_vector, pattern=('shape','fill','size')):
    reverser = [{v:k2 for k2,v in work3[k].items()} for k in pattern]
    for index, vect in enumerate(a_vector):
        yield pattern[index], reverser[index][vect]

print (list(unvectorize3(vectorize3(a))))
print (list(unvectorize3(vectorize3(b))))

最终评论：

这些实现中的每一个都在全局存储状态多变的。我觉得这是反审美的，但取决于你是什么计划稍后处理该矢量，这可能是一个功能。正如我证明了。

编辑：另一天对此进行冥想，以及我可能需要它的各种情况，我认为我会像这样封装此功能：

from collections import defaultdict
from itertools import count
class slotter4:
    def __init__(self):
        #keep track what order we expect to see keys
        self.pattern = defaultdict(count(1).__next__)
        #keep track of what values we've seen and what number we've assigned to mean them.
        self.work = defaultdict(lambda :defaultdict(count(1).__next__))
    def slot(self, kv, i=False):
        """used to be named verbal_to_value"""
        k, v = kv
        if i and i != self.pattern[k]:# keep track of order we saw initial keys
            raise ValueError("Input fields out of order")
            #in theory we could ignore this error, and just know
            #that we're going to default to the field order we saw
            #first. Or we could just not keep track, which might be
            #required, if our code runs to slow, but then we cannot
            #make pattern optional in .unvectorize()
        return self.work[k][v]
    def vectorize(self, alist):
        return [self.slot(kv, i) for i, kv in enumerate(alist,1)]
        #if we're not keeping track of field pattern, we could do this instead
        #return [self.work[k][v] for k, v in alist]
    def unvectorize(self, a_vector, pattern=None):
        if pattern is None:
            pattern = [k for k,v in sorted(self.pattern.items(), key=lambda a:a[1])]
        reverser = [{v:k2 for k2,v in work3[k].items()} for k in pattern]
        return [(pattern[index], reverser[index][vect]) 
                for index, vect in enumerate(a_vector)]

#test suite4:
s = slotter4()
if __name__=='__main__':
    Av = s.vectorize(a)
    Bv = s.vectorize(b)
    print (Av) #shows [1,1,1]
    print (Bv) #shows [1,2,2]
    print (s.unvectorize(Av))#shows a
    print (s.unvectorize(Bv))#shows b
else:
    #run the test silently, and only complain if something has broken
    assert s.unvectorize(s.vectorize(a))==a
    assert s.unvectorize(s.vectorize(b))==b

祝你好运！

一个将字符串元组作为 return 整数的记忆函数？

A memoized function that takes a tuple of strings to return an integer?

python

arrays

yield

memoization