交叉验证时关键错误不在索引中

Question

我已经在我的数据集上应用了 svm。我的数据集是多标签的，这意味着每个观察值都有多个标签。

而 KFold cross-validation 它会引发错误 not in index。

它显示了从 601 到 6007 的索引 not in index（我有 1...6008 个数据样本）。

这是我的代码：

   df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])

for category in categories:
    print('... Processing {} '.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])
    print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
    print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
    print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])

其实我不知道如何应用KFold交叉验证，我可以分别得到每个标签的F1分数和准确率。查看 this and this 对我如何才能成功申请我的案子没有帮助。

为了可重现，这是数据框的一个小样本 最后七个功能是我的标签，包括 ADR、WD、...

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1

更新

当我按照 Vivek Kumar 所说的去做时它引发了错误

ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]

在分类器部分。你知道如何解决吗？

在 Whosebug 中有几个链接说明我需要重塑训练数据。我也这样做了，但没有成功谢谢:)

Answer 1

train_index、test_index是基于行数的整数索引。但是 pandas 索引不是那样工作的。 pandas 的较新版本在如何从中切片或 select 数据方面更加严格。

您需要使用 .iloc 来访问数据。更多信息available here

这就是你需要的：

for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    ...
    ...

    # TfidfVectorizer dont work with DataFrame, 
    # because iterating a DataFrame gives the column names, not the actual data
    # So specify explicitly the column name, to get the sentences

    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])

交叉验证时关键错误不在索引中

key error not in index while cross validation

python

scikit-learn

cross-validation