Python 中用于预测的逻辑回归分类器

logistic regression classifier for prediction in Python

我正在尝试制作一个脚本,该脚本采用 json 文件(pizza-train.json)(来自 this Kaggle competition。我想从中提取 request_text 字段列表中的每个字典,并构造一个表示字符串的词袋(string to count-list)。

下一步是训练逻辑回归分类器来预测变量“requester_received_pizza”。我想训练 90% 的数据并预测 10% 的数据。问题是我不知道如何预测这 10%。任何建议都会很有帮助!

import json
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer


f_json = json.load(open('pizza-train.json'))
request_text = []
y = []

for item in f_json[:100]:
    request_text.append(item['request_text'])
    y.append(item['requester_received_pizza'])

vectorizer = CountVectorizer(min_df=1, lowercase=True, stop_words='english')

train_data_features = vectorizer.fit_transform(request_text)
train_data_features = train_data_features.toarray()


print 'Shape = '
print train_data_features.shape
vocab = vectorizer.get_feature_names()
print '\n'
print 'Vocab = '
print vocab


x_train, x_test, y_train, y_test = train_test_split(train_data_features, y, test_size=0.10)

你可以这样做:

alg = sklearn.linear_model.LogisticRegression()
alg.fit(x_train, y_train)
test_score = alg.score(x_test, y_test)

您应该阅读 sklearn 文档 logistic regression and cross validation, which are very good and provide more sophisticated methods for validating your models. This Kaggle 泰坦尼克号竞赛教程也可能有用。