如何预测训练数据集之外的数据
How to predict data outside of the training data set
使用此模块根据地址预测国家名称:
import re
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
def normalize_text(s):
s = s.lower()
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
s = re.sub('\s+',' ',s)
return(s)
df['TEXT'] = [normalize_text(s) for s in df['Full_Address']]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(df['CountryName'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
accuracy_score(y_test, y_predicted)
我想使用我构建的模块来预测单个字符串地址,我该怎么做?
我试过了:
nb.predict('1100 112th Ave NE #400, Bellevue, WA 98004, United States')
ValueError: Expected 2D array, got scalar array instead:
array=1100 112th Ave NE #400, Bellevue, WA 98004, United States.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
更新:
如回答中所建议:
nb.predict([['1100 112th Ave NE #400, Bellevue, WA 98004, United States']])
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 82043 is different from 1)
使用:
nb.predict([['1100 112th Ave NE #400, Bellevue, WA 98004, United States']])
预测方法需要一个数组。
要预测您需要通过您为训练模型所做的所有预处理步骤来传递数据:
single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
normalized_address = normalize_text(single_address)
vectorized_address = vectorizer.transform([normalized_address])
#expected output
nb.predict(vectorized_address)
注意 2 种改进代码的方法:
normalize_text
这一步并不是真正必要的,因为它所做的所有事情都会被 CountVectorizer 的分词器正则表达式 token_pattern='(?u)\b\w\w+\b'
和 lowercase=True
[=18 捕获=]
将所有预处理保留在 sklearn Pipeline
中。这样你的代码会更干净,更不容易出错(你肯定会避免像你一样的错误)
一个有效的[规范?]模板如何实现:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
X = 30*['1100 112th Ave NE #400, Bellevue, WA 98004, United States']
y = 10*['US','France','Germany']
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = CountVectorizer()
mnb = MultinomialNB()
ppl = Pipeline(steps=[('vectorizer',vectorizer),('mnb',mnb)])
ppl.fit(X_train, y_train)
single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
ppl.predict([single_address])
拥有 Pipeline
的额外好处是您可以通过 GridSearchCV
传递它,以便通过交叉验证选择最佳参数。
使用此模块根据地址预测国家名称:
import re
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
def normalize_text(s):
s = s.lower()
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
s = re.sub('\s+',' ',s)
return(s)
df['TEXT'] = [normalize_text(s) for s in df['Full_Address']]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(df['CountryName'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
accuracy_score(y_test, y_predicted)
我想使用我构建的模块来预测单个字符串地址,我该怎么做? 我试过了:
nb.predict('1100 112th Ave NE #400, Bellevue, WA 98004, United States')
ValueError: Expected 2D array, got scalar array instead:
array=1100 112th Ave NE #400, Bellevue, WA 98004, United States.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
更新:
如回答中所建议:
nb.predict([['1100 112th Ave NE #400, Bellevue, WA 98004, United States']])
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 82043 is different from 1)
使用:
nb.predict([['1100 112th Ave NE #400, Bellevue, WA 98004, United States']])
预测方法需要一个数组。
要预测您需要通过您为训练模型所做的所有预处理步骤来传递数据:
single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
normalized_address = normalize_text(single_address)
vectorized_address = vectorizer.transform([normalized_address])
#expected output
nb.predict(vectorized_address)
注意 2 种改进代码的方法:
normalize_text
这一步并不是真正必要的,因为它所做的所有事情都会被 CountVectorizer 的分词器正则表达式token_pattern='(?u)\b\w\w+\b'
和lowercase=True
[=18 捕获=]将所有预处理保留在 sklearn
Pipeline
中。这样你的代码会更干净,更不容易出错(你肯定会避免像你一样的错误)
一个有效的[规范?]模板如何实现:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
X = 30*['1100 112th Ave NE #400, Bellevue, WA 98004, United States']
y = 10*['US','France','Germany']
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = CountVectorizer()
mnb = MultinomialNB()
ppl = Pipeline(steps=[('vectorizer',vectorizer),('mnb',mnb)])
ppl.fit(X_train, y_train)
single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
ppl.predict([single_address])
拥有 Pipeline
的额外好处是您可以通过 GridSearchCV
传递它,以便通过交叉验证选择最佳参数。