如何将预测数据列添加到我的数据框中?
How to add a predicted-data column to my dataframe?
我正在使用朴素贝叶斯从一组地址中预测国家/地区名称,我试过了
import re
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
def normalize_text(s):
s = s.lower()
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
s = re.sub('\s+',' ',s)
return(s)
df['TEXT'] = [normalize_text(s) for s in df['Full_Address']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(df['CountryName'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
所以我想要的是用预测的国家/地区名称在我的数据框中添加另一列,我该如何实现?
更新:
df['Predicted'] = nb.predict(x)
CountryName Full_Address \
8913 Indonesia EJIP Industrial Park Plot 1E-2, Sukaresmi, Cik...
7870 United States 360 Thelma Street, Sandusky, Michigan 48471 USA
32037 China 1027, 26/F, Zhao Feng Mansion, Chang Ning Road...
38769 New Zealand NZ - 164 ST. ASAPH STREET, \tCHRISTCHURCH 8011...
46639 India 301-306, Sahajanand Trade Center, Opp. Kothawa...
TEXT Predicted
8913 ejip industrial park plot 1e-2 sukaresmi cikar... 66
7870 360 thelma street sandusky michigan 48471 usa 169
32037 1027 26/f zhao feng mansion chang ning road sh... 30
38769 nz 164 st asaph street christchurch 8011 new z... 112
46639 301-306 sahajanand trade center opp kothawala ... 65
您应该对 y
的预测值使用 encoder.fit_transform
的逆函数,应用于模型的输出。所以像
df['Predicted'] = encoder.inverse_transform(nb.predict(x))
这假定 nb.predict(x)
的输出是一个整数列表(而不是列表的列表)——如果不是,您可能已经这样做了一些整形。因为我不能 运行 你的代码不能访问 df
我真的不能说
我正在使用朴素贝叶斯从一组地址中预测国家/地区名称,我试过了
import re
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
def normalize_text(s):
s = s.lower()
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
s = re.sub('\s+',' ',s)
return(s)
df['TEXT'] = [normalize_text(s) for s in df['Full_Address']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(df['CountryName'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
所以我想要的是用预测的国家/地区名称在我的数据框中添加另一列,我该如何实现?
更新:
df['Predicted'] = nb.predict(x)
CountryName Full_Address \
8913 Indonesia EJIP Industrial Park Plot 1E-2, Sukaresmi, Cik...
7870 United States 360 Thelma Street, Sandusky, Michigan 48471 USA
32037 China 1027, 26/F, Zhao Feng Mansion, Chang Ning Road...
38769 New Zealand NZ - 164 ST. ASAPH STREET, \tCHRISTCHURCH 8011...
46639 India 301-306, Sahajanand Trade Center, Opp. Kothawa...
TEXT Predicted
8913 ejip industrial park plot 1e-2 sukaresmi cikar... 66
7870 360 thelma street sandusky michigan 48471 usa 169
32037 1027 26/f zhao feng mansion chang ning road sh... 30
38769 nz 164 st asaph street christchurch 8011 new z... 112
46639 301-306 sahajanand trade center opp kothawala ... 65
您应该对 y
的预测值使用 encoder.fit_transform
的逆函数,应用于模型的输出。所以像
df['Predicted'] = encoder.inverse_transform(nb.predict(x))
这假定 nb.predict(x)
的输出是一个整数列表(而不是列表的列表)——如果不是,您可能已经这样做了一些整形。因为我不能 运行 你的代码不能访问 df
我真的不能说