如何使用 sklearn 的 SGDClassifier 获得 Top 3 或 Top N 预测
How to get Top 3 or Top N predictions using sklearn's SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
#prints: elements
在上面的代码中,<strong>clf.predict()</strong>
仅打印来自 [=13= 的样本的 1 个最佳预测]列表 X。
我对 列表 X 中特定样本的 前 3 个预测 感兴趣,我知道函数 <strong> predict_proba</strong>
/<strong>predict_log_proba</strong>
returns所有列表列表 Y 中每个特征的概率,但在获得 前 3 个结果 之前必须对其进行排序,然后与列表 Y 中的特征相关联。
有什么直接有效的方法吗?
没有内置函数,但是有什么问题
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[-n:]
?
根据评论之一的建议,应将 [-n:]
更改为 [:,-n:]
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[:,-n:]
希望 Andreas 对此有所帮助。当 loss='hinge' 时,predict_probs 不可用。要在 loss='hinge' 时获得前 n class 做:
calibrated_clf = CalibratedClassifierCV(clfSDG, cv=3, method='sigmoid')
model = calibrated_clf.fit(train.data, train.label)
probs = model.predict_proba(test_data)
sorted( zip( calibrated_clf.classes_, probs[0] ), key=lambda x:x[1] )[-n:]
不确定 clfSDG.predict 和 calibrated_clf.predict 是否总是预测相同的 class。
我知道已经回答了...但我可以添加更多...
#both preds and truths are same shape m by n (m is number of predictions and n is number of classes)
def top_n_accuracy(preds, truths, n):
best_n = np.argsort(preds, axis=1)[:,-n:]
ts = np.argmax(truths, axis=1)
successes = 0
for i in range(ts.shape[0]):
if ts[i] in best_n[i,:]:
successes += 1
return float(successes)/ts.shape[0]
它又快又脏,但我发现它很有用。可以自己加上错误检查等等。
argsort
以升序给出结果,如果你想避免不寻常的循环或混淆,你可以使用一个简单的技巧。
probs = clf.predict_proba(test)
best_n = np.argsort(-probs, axis=1)[:, :n]
取反概率会从最小到最大,因此您可以按降序取前 n 个结果。
正如@FredFoo 在How do I get indices of N maximum values in a NumPy array? a faster method would be to use argpartition
中描述的那样。
Newer NumPy versions (1.8 and up) have a function called argpartition
for this. To get the indices of the four largest elements, do
>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind array([1, 5, 8, 0])
>>> a[ind] array([4, 9, 6, 9])
Unlike argsort
, this function runs in linear time in the worst case, but the returned indices are not
sorted, as can be seen from the result of evaluating a[ind]
. If you
need that too, sort them afterwards:
>>> ind[np.argsort(a[ind])] array([1, 8, 5, 0])
To get the top-k
elements in sorted order in this way takes O(n + k log k)
time.
我写了一个函数,它输出一个数据框,其中包含前 n 个预测及其概率,并将其与 class 名称联系起来。希望这对您有所帮助!
def return_top_n_pred_prob_df(n, model, X_test, column_name):
predictions = model.predict_proba(X_test)
preds_idx = np.argsort(-predictions)
classes = pd.DataFrame(model.classes_, columns=['class_name'])
classes.reset_index(inplace=True)
top_n_preds = pd.DataFrame()
for i in range(n):
top_n_preds[column_name + '_prediction_{}_num'.format(i)] = [preds_idx[doc][i] for doc in range(len(X_test))]
top_n_preds[column_name + '_prediction_{}_probability'.format(i)] = [predictions[doc][preds_idx[doc][i]] for doc in range(len(X_test))]
top_n_preds = top_n_preds.merge(classes, how='left', left_on= column_name + '_prediction_{}_num'.format(i), right_on='index')
top_n_preds = top_n_preds.rename(columns={'class_name': column_name + '_prediction_{}'.format(i)})
try: top_n_preds.drop(columns=['index', column_name + '_prediction_{}_num'.format(i)], inplace=True)
except: pass
return top_n_preds
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
#prints: elements
在上面的代码中,<strong>clf.predict()</strong>
仅打印来自 [=13= 的样本的 1 个最佳预测]列表 X。
我对 列表 X 中特定样本的 前 3 个预测 感兴趣,我知道函数 <strong> predict_proba</strong>
/<strong>predict_log_proba</strong>
returns所有列表列表 Y 中每个特征的概率,但在获得 前 3 个结果 之前必须对其进行排序,然后与列表 Y 中的特征相关联。
有什么直接有效的方法吗?
没有内置函数,但是有什么问题
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[-n:]
?
根据评论之一的建议,应将 [-n:]
更改为 [:,-n:]
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[:,-n:]
希望 Andreas 对此有所帮助。当 loss='hinge' 时,predict_probs 不可用。要在 loss='hinge' 时获得前 n class 做:
calibrated_clf = CalibratedClassifierCV(clfSDG, cv=3, method='sigmoid')
model = calibrated_clf.fit(train.data, train.label)
probs = model.predict_proba(test_data)
sorted( zip( calibrated_clf.classes_, probs[0] ), key=lambda x:x[1] )[-n:]
不确定 clfSDG.predict 和 calibrated_clf.predict 是否总是预测相同的 class。
我知道已经回答了...但我可以添加更多...
#both preds and truths are same shape m by n (m is number of predictions and n is number of classes)
def top_n_accuracy(preds, truths, n):
best_n = np.argsort(preds, axis=1)[:,-n:]
ts = np.argmax(truths, axis=1)
successes = 0
for i in range(ts.shape[0]):
if ts[i] in best_n[i,:]:
successes += 1
return float(successes)/ts.shape[0]
它又快又脏,但我发现它很有用。可以自己加上错误检查等等。
argsort
以升序给出结果,如果你想避免不寻常的循环或混淆,你可以使用一个简单的技巧。
probs = clf.predict_proba(test)
best_n = np.argsort(-probs, axis=1)[:, :n]
取反概率会从最小到最大,因此您可以按降序取前 n 个结果。
正如@FredFoo 在How do I get indices of N maximum values in a NumPy array? a faster method would be to use argpartition
中描述的那样。
Newer NumPy versions (1.8 and up) have a function called
argpartition
for this. To get the indices of the four largest elements, do
>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind array([1, 5, 8, 0])
>>> a[ind] array([4, 9, 6, 9])
Unlike
argsort
, this function runs in linear time in the worst case, but the returned indices are not sorted, as can be seen from the result of evaluatinga[ind]
. If you need that too, sort them afterwards:
>>> ind[np.argsort(a[ind])] array([1, 8, 5, 0])
To get the
top-k
elements in sorted order in this way takesO(n + k log k)
time.
我写了一个函数,它输出一个数据框,其中包含前 n 个预测及其概率,并将其与 class 名称联系起来。希望这对您有所帮助!
def return_top_n_pred_prob_df(n, model, X_test, column_name):
predictions = model.predict_proba(X_test)
preds_idx = np.argsort(-predictions)
classes = pd.DataFrame(model.classes_, columns=['class_name'])
classes.reset_index(inplace=True)
top_n_preds = pd.DataFrame()
for i in range(n):
top_n_preds[column_name + '_prediction_{}_num'.format(i)] = [preds_idx[doc][i] for doc in range(len(X_test))]
top_n_preds[column_name + '_prediction_{}_probability'.format(i)] = [predictions[doc][preds_idx[doc][i]] for doc in range(len(X_test))]
top_n_preds = top_n_preds.merge(classes, how='left', left_on= column_name + '_prediction_{}_num'.format(i), right_on='index')
top_n_preds = top_n_preds.rename(columns={'class_name': column_name + '_prediction_{}'.format(i)})
try: top_n_preds.drop(columns=['index', column_name + '_prediction_{}_num'.format(i)], inplace=True)
except: pass
return top_n_preds