赋值时推给KeyError

Shove giving KeyError when assigning value

我正在使用 shove 来避免将庞大的词典加载到内存中。

from shove import Shove

lemmaDict = Shove('file://storage')
with open(str(sys.argv[1])) as lemmaCPT:\
    for line in lemmaCPT:
        line = line.rstrip('\n')
        lineAr = string.split(line, ' ||| ')
        lineKey = lineAr[0] + ' ||| ' + lineAr[1]
        lineValue = lineAr[2]
        print lineValue
        lemmaDict[lineKey] = lineValue

但是,我在阅读 lemmaCPT 的过程中遇到了以下 KeyError 和 Traceback。怎么回事?

Traceback (most recent call last):
  File "./stemmer.py", line 19, in <module>
    lemmaDict[lineKey] = lineValue
  File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 44, in __setitem__
    self.sync()
  File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 74, in sync
    self._store.update(self._buffer)
  File "/opt/Python-2.7.6/lib/python2.7/_abcoll.py", line 542, in update
    self[key] = other[key]
  File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/base.py", line 123, in __setitem__
    raise KeyError(key)
KeyError: '! ! ! \xd1\x87\xd0\xb8\xd1\x82\xd0\xb0\xd0\xb5\xd1\x82\xd1\x81\xd1\x8f \xd1\x82\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x80\xd0\xb0\xd1\x82\xd0\xbd\xd1\x8b\xd0\xbc \xd0\xbf\xd0\xbe\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5\xd0\xbc \xd0\xbb\xd1\x8e\xd0\xb1\xd0\xbe\xd0\xb3\xd0\xbe ||| ! ! ! is pronounced by'

示例输入:

! ! ! читается троекратным повторением ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.0128612
! ! ! читается троекратным повторением ||| ! ! ! ||| 0.000119622 8.53148e-39 0.0098932 0.590703
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced by ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.00154241
! ! ! читается троекратным повторением ||| , ! ! ! ||| 0.0074488 8.53148e-39 0.00989281 0.070842
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39

运行 code.py sampleinput 将产生上述的 KeyError 和 Traceback。

好吧,如果这是实际输入,那么问题出在长度 LemmaDictinput...

aftnix@dev:~⟫ cat input | wc -l
11

我更改的代码....

from shove import Shove
import sys
import string

lemmaDict = Shove('file://storage')
i = 0
with open(str(sys.argv[1])) as lemmaCPT:
    for line in lemmaCPT:
        line = line.rstrip('\n')
        lineAr = string.split(line, ' ||| ')
        lineKey = lineAr[0] + ' ||| ' + lineAr[1]
        lineValue = lineAr[2]
        print lineValue
        print len(lemmaDict)
        #print len(lemmaCPT)
        i+=1
        print i
        #lemmaDict[lineKey] = lineValue

给出以下输出...

0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
1
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
2
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
3
0.00819374 8.53148e-39 0.00989281 0.0128612
9
4
0.000119622 8.53148e-39 0.0098932 0.590703
9
5
0.00819374 8.53148e-39 0.00989281 8.53148e-39
9
6
0.00819374 8.53148e-39 0.00989281 8.53148e-39
9
7
0.00819374 8.53148e-39 0.00989281 0.00154241
9
8
0.0074488 8.53148e-39 0.00989281 0.070842
9
9
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
10
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9

所以你只是在超越 Dict

如果您从输入中删除两行,它将停止抛出异常。

我不知道 shove,但快速检查 shell 告诉我它总是 returns 行键控字典。必须有一种方法来培养它……也许有一种方法或类似的东西……你应该更仔细地挖掘它的文档

我只是觉得你使用 Shove 的方式不对。

编辑:这有点奇怪...在查看 Shove 代码后,发现它应该在达到缓冲区限制时同步它的内存内容...

def __setitem__(self, key, value):
        self._cache[key] = self._buffer[key] = value
        # when buffer reaches self._limit, write buffer to store
        if len(self._buffer) >= self._sync:
            self.sync()

编辑 2

好吧,我之前的观点完全错了...但我得到了一些有趣的指示。其中一个问题是 shove 引发了一个令人困惑的异常...

真正的异常发生是因为...

def __setitem__(self, key, value):
   118          # (per Larry Meyn)
   119          try:
   120              with open(self._key_to_file(key), 'wb') as item:
   121                  item.write(self.dumps(value))
   122          except (IOError, OSError):
   123              raise KeyError(key)

所以异常实际上来自open系统调用。这意味着它在写入文件时遇到了麻烦。我对字符串的长度有了新的怀疑...

storage 文件夹的外观...

 aftnix@dev:~⟫ ls -l storage/                                                                                                                                   
    total 36
    -rw-rw-r-- 1 aftnix aftnix 49 ডিসে   4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%21+%21+%21

-rw-rw-r-- 1 aftnix aftnix 52 ডিসে   4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%2C+%21+%21+%21+is+pronounced

所以 shove 正在使用密钥作为文件名。所以它可能会变得非常难看,因为你的字符串在最后两个条目中非常大,尤其是倒数第二个条目。因此,为了进行测试,我从输入的最后两行中删除了一些字符。并且代码 运行 没有任何异常。

Linux内核有文件名长度限制....

aftnix@dev:~⟫ cat /usr/include/linux/limits.h 
#ifndef _LINUX_LIMITS_H
#define _LINUX_LIMITS_H

#define NR_OPEN         1024

#define NGROUPS_MAX    65536    /* supplemental group IDs are available */
#define ARG_MAX       131072    /* # bytes of args + environ for exec() */
#define LINK_MAX         127    /* # links a file may have */
#define MAX_CANON        255    /* size of the canonical input queue */
#define MAX_INPUT        255    /* size of the type-ahead buffer */
#define NAME_MAX         255    /* # chars in a file name */

因此,要绕过它,您必须做些别的事情。您不能将香草解析的密钥放入 Shove.