恢复 pandas KeyErrors 似乎是随机的,没有更改我的代码。可能的内存错误?

Reviving pandas KeyErrors seemingly at random, without changing my code. Possible memory errors?

我正在阅读一本 sklearn 教程书,其中包含以下部分:

Next, we create a TfidfVectorizer. Recall from Chapter 3, Feature Extraction and Preprocessing, that TfidfVectorizer combines CountVectorizer and TfidfTransformer. We fit it with the training messages, and transform both the training and test messages:

>>> vectorizer = TfidfVectorizer()
>>> X_train = vectorizer.fit_transform(X_train_raw)
>>> X_test = vectorizer.transform(X_test_raw)
Finally, we create an instance of LogisticRegression and train our model. Like LinearRegression, LogisticRegression implements the fit() and predict() methods. As a sanity check, we printed a few predictions for manual inspection:

>>> classifier = LogisticRegression()
>>> classifier.fit(X_train, y_train)
>>> predictions = classifier.predict(X_test)
>>> for i, prediction in enumerate(predictions[:5]):
>>>     print 'Prediction: %s. Message: %s' % (prediction, X_test_raw[i])
The following is the output of the script:

Prediction: ham. Message: If you don't respond imma assume you're still asleep and imma start calling n shit
Prediction: spam. Message: HOT LIVE FANTASIES call now 08707500020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870 is a national rate call
Prediction: ham. Message: Yup... I havent been there before... You want to go for the yoga? I can call up to book 
Prediction: ham. Message: Hi, can i please get a  <#>  dollar loan from you. I.ll pay you back by mid february. Pls.
Prediction: ham. Message: Where do you need to go to get it?

然后我跟着:

ddir = (sys.argv[1])


df = pd.read_csv(ddir + '/SMSSpamCollection', delimiter='\t', header=None)

#print df.head
#print 'Number of spam messages: ', df[df[0] == 'spam'][0].count()
#print 'Number of ham messages: ', df[df[0] == 'ham'][0].count()

X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)


classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

for i, pdn in enumerate(predictions):
    print 'Prediction: %s. Message: %s' % (pdn, X_test_raw[i])

但是,由于某种原因,这给了我一个错误。认为这是我的修改,我重写了我的代码以遵循本书的行:

for i, prediction in enumerate(predictions[:5]):
    print 'Prediction: %s. Message: %s' % (prediction, X_test_raw[i])

然而,这在崩溃前只打印了两个答案:

Number of spam messages: 747
Number of ham messages: 4825
['ham' 'ham' 'ham' ..., 'ham' 'ham' 'ham']
Prediction: ham. Message: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Prediction: ham. Message: Ok lar... Joking wif u oni...
Traceback (most recent call last):
  File "Chapter4[B-FLGTLG][Y-SF]--[DC].py", line 38, in <module>
    print 'Prediction: %s. Message: %s' % (prediction, X_test_raw[i])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 583, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1980, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/index.pyx", line 103, in pandas.index.IndexEngine.get_value (pandas/index.c:3332)
  File "pandas/index.pyx", line 111, in pandas.index.IndexEngine.get_value (pandas/index.c:3035)
  File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
  File "pandas/hashtable.pyx", line 303, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6610)
  File "pandas/hashtable.pyx", line 309, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6554)
KeyError: 2

现在的问题是:我 运行 完全相同的脚本,没有更改一行,第二次,它给了我一个 不同的错误 :

Number of spam messages: 747
Number of ham messages: 4825
['ham' 'ham' 'ham' ..., 'ham' 'ham' 'ham']
Traceback (most recent call last):
  File "Chapter4[B-FLGTLG][Y-SF]--[DC].py", line 38, in <module>
    print 'Prediction: %s. Message: %s' % (prediction, X_test_raw[i])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 583, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1980, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/index.pyx", line 103, in pandas.index.IndexEngine.get_value (pandas/index.c:3332)
  File "pandas/index.pyx", line 111, in pandas.index.IndexEngine.get_value (pandas/index.c:3035)
  File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
  File "pandas/hashtable.pyx", line 303, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6610)
  File "pandas/hashtable.pyx", line 309, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6554)
KeyError: 0

这怎么可能?我是否有一些低级别的 ram 错误干扰了 python 本身?数据在这里:http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection以防有人想跟进。

更新

我从这里找到了解决方案:

我改了:

print 'Prediction: %s. Message: %s' % (pdn, X_test_raw[i])

print 'Prediction: %s. Message: %s' % (pdn, X_test_raw.iloc[i])

现在可以正常工作了:

Number of spam messages: 747
Number of ham messages: 4825
['ham' 'spam' 'ham' ..., 'spam' 'ham' 'ham']
Prediction: ham. Message: Well done, blimey, exercise, yeah, i kinda remember wot that is, hmm. 
Prediction: spam. Message: U have won a nokia 6230 plus a free digital camera. This is what u get when u win our FREE auction. To take part send NOKIA to 83383 now. POBOX114/14TCR/W1 16
Prediction: ham. Message: I doubt you could handle 5 times per night in any case...
Prediction: ham. Message: I've told you everything will stop. Just dont let her get dehydrated.
Prediction: ham. Message: AH POOR BABY!HOPE URFEELING BETTERSN LUV! PROBTHAT OVERDOSE OF WORK HEY GO CAREFUL SPK 2 U SN LOTS OF LOVEJEN XXX.