获得猜测的准确性
Get accuracy of guess
我目前正在尝试使用 this SO question
查找单词列表的发音
下面代码如下:
import random
def scramble(s):
return "".join(random.sample(s, len(s)))
words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]
X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([
('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
('clf', MultinomialNB())
])
text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))
这会输出随机单词 this
>>> text_clf.predict("scaroly".split())
['word']
我一直在检查 scikit documentation 但我似乎仍然无法找到如何让它打印输入单词的分数。
尝试 sklearn.pipeline.Pipeline.predict_proba
:
>>> text_clf.predict_proba(["scaroly"])
array([[ 5.87363027e-04, 9.99412637e-01]])
它 returns 给定输入(在本例中为 "scaroly"
)属于您训练模型所依据的 类 的可能性。所以有 99.94% 的机会 "scaroly"
是可发音的。
相反,"new" 的威尔士语单词可能发音不准:
>>> text_clf.predict_proba(["newydd"])
array([[ 0.99666533, 0.00333467]])
我目前正在尝试使用 this SO question
查找单词列表的发音下面代码如下:
import random
def scramble(s):
return "".join(random.sample(s, len(s)))
words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]
X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([
('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
('clf', MultinomialNB())
])
text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))
这会输出随机单词 this
>>> text_clf.predict("scaroly".split())
['word']
我一直在检查 scikit documentation 但我似乎仍然无法找到如何让它打印输入单词的分数。
尝试 sklearn.pipeline.Pipeline.predict_proba
:
>>> text_clf.predict_proba(["scaroly"])
array([[ 5.87363027e-04, 9.99412637e-01]])
它 returns 给定输入(在本例中为 "scaroly"
)属于您训练模型所依据的 类 的可能性。所以有 99.94% 的机会 "scaroly"
是可发音的。
相反,"new" 的威尔士语单词可能发音不准:
>>> text_clf.predict_proba(["newydd"])
array([[ 0.99666533, 0.00333467]])