如何使用 python 读取 MLComp 数据集?

How to read MLComp dataset using python?

MLComp 数据集有我不知道的特殊类型的文件格式。我想使用 python 阅读但不能。

首先要注意的是 sklearn(v0.17.1,截至 2016 年 7 月 24 日),仅支持 mlcompDocumentClassification 域。

假设您已经下载了例如WebKB dataset,其中有 id=523,到 /somewhere/on/your/computer,您可以使用以下 sklearn 片段来加载数据集并训练分类器:

from sklearn.datasets import load_mlcomp
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB

# Load mlcomp data using sklearn
train_data = load_mlcomp(name_or_id=523, set_='train', mlcomp_root='/somewhere/on/your/computer')
test_data = load_mlcomp(name_or_id=523, set_='test', mlcomp_root='/somewhere/on/your/computer')
# if you had the environment variable `MLCOMP_DATASETS_HOME` set, you wouldn't need to explicitly pass anything to `mlcomp_root`

# `data` is a standard `Bunch` object, so you can now straightforwardly go on and vectorize the dataset,...
vec = CountVectorizer(decode_error='replace')
X_train = vec.fit_transform(train_data.data)
X_test = vec.transform(test_data.data)

# ...train a classifier... 
mnb = MultinomialNB()
mnb.fit(X_train, train_data.target)

# ...and evaluate it.
print('Accuracy: {}'.format(accuracy_score(test_data.target, mnb.predict(X_test))))