以不同类型的列作为训练数据集
Taking columns of different type as training dataset
我之前只将一列(字符串类型数据)作为我的训练集,我想将另一个相应的列(浮点类型的金额列)与 Details 列一起考虑作为训练集。
在金额栏中,负值表示借方,正值表示贷方。
我该如何处理,我尝试将两列附加在一起,但我
必须将 float 类型的金额转换为字符串类型,这不会
在我的数据集中有任何意义。
我想包括 Amount 列以检查机器是否可以学习变化,这在这种情况下非常重要。
提前致谢。
Details |Amount |Category
-------------------------------------------------------------
Tanishq Jwellery Bangalore |-990 |jwellery
ODESK***BAL-28APR13 |240 |Others
AEGON RELIGARE LIFE IN |456 |Others
INTERNET PAYMENT #999999 |-250 |Transfer in for Card Payment
WWW.VISTAPRINT.IN |245 |Print
Khazana Jwellery |-9000 |jwellery
INTERNET PAYMENT #999999 |785 |Transfer in for Card Payment
Indian Oil |344 |Fuel
Touch foot wear |-782 |Clothing
我的部分脚本:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
import time
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# TRAIN DATA
data= pd.read_csv('ds1.csv', delimiter=',',usecols=['Details','Amount','Category'],encoding='utf-8')
data=data[data.Category !="Others"]
target_one=data['Category']
target_list=data['Category'].unique()
# TEST DATASET
test_data=pd.read_csv('ds2.csv', delimiter='\t',usecols=['Details','Amount','Category'],encoding='utf-8')
x_train, y_train = (data.Details, data.Category )
x_test, y_test = (test_data.Details, test_data.Category)
vect = CountVectorizer(ngram_range=(1,2))
X_train = vect.fit_transform(x_train)
X_test = vect.transform(x_test)
start = time.clock()
mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)
result= mnb.predict(X_test)
print (time.clock()-start)
accuracy_score(result,y_test)
如果您只想将 "amount" 列堆叠到使用 CountVectorizer
获得的文本特征矩阵,只需在拟合 MultinomialNB
:
之前执行此操作
import numpy as np
X_amount = data["Amount"].as_matrix().reshape(-1, 1)
X_train = X_train.toarray()
X_train = np.hstack((X_train, X_amount))
X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)
X_test = X_test.toarray()
X_test = np.hstack((X_test, X_test_amount))
或者如果您想继续处理 X_train 的稀疏矩阵:
import scipy as sp
X_amount = data["Amount"].as_matrix().reshape(-1, 1)
X_train = sp.sparse.hstack((X_train, X_amount))
X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)
X_test = sp.sparse.hstack((X_test, X_test_amount))
但是,我认为您最终会得到 ValueError: Input X must be non-negative
,因为 MultinomialNB
旨在与 non-negative 特征值一起使用...
我之前只将一列(字符串类型数据)作为我的训练集,我想将另一个相应的列(浮点类型的金额列)与 Details 列一起考虑作为训练集。 在金额栏中,负值表示借方,正值表示贷方。 我该如何处理,我尝试将两列附加在一起,但我 必须将 float 类型的金额转换为字符串类型,这不会 在我的数据集中有任何意义。 我想包括 Amount 列以检查机器是否可以学习变化,这在这种情况下非常重要。 提前致谢。
Details |Amount |Category
-------------------------------------------------------------
Tanishq Jwellery Bangalore |-990 |jwellery
ODESK***BAL-28APR13 |240 |Others
AEGON RELIGARE LIFE IN |456 |Others
INTERNET PAYMENT #999999 |-250 |Transfer in for Card Payment
WWW.VISTAPRINT.IN |245 |Print
Khazana Jwellery |-9000 |jwellery
INTERNET PAYMENT #999999 |785 |Transfer in for Card Payment
Indian Oil |344 |Fuel
Touch foot wear |-782 |Clothing
我的部分脚本:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
import time
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# TRAIN DATA
data= pd.read_csv('ds1.csv', delimiter=',',usecols=['Details','Amount','Category'],encoding='utf-8')
data=data[data.Category !="Others"]
target_one=data['Category']
target_list=data['Category'].unique()
# TEST DATASET
test_data=pd.read_csv('ds2.csv', delimiter='\t',usecols=['Details','Amount','Category'],encoding='utf-8')
x_train, y_train = (data.Details, data.Category )
x_test, y_test = (test_data.Details, test_data.Category)
vect = CountVectorizer(ngram_range=(1,2))
X_train = vect.fit_transform(x_train)
X_test = vect.transform(x_test)
start = time.clock()
mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)
result= mnb.predict(X_test)
print (time.clock()-start)
accuracy_score(result,y_test)
如果您只想将 "amount" 列堆叠到使用 CountVectorizer
获得的文本特征矩阵,只需在拟合 MultinomialNB
:
import numpy as np
X_amount = data["Amount"].as_matrix().reshape(-1, 1)
X_train = X_train.toarray()
X_train = np.hstack((X_train, X_amount))
X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)
X_test = X_test.toarray()
X_test = np.hstack((X_test, X_test_amount))
或者如果您想继续处理 X_train 的稀疏矩阵:
import scipy as sp
X_amount = data["Amount"].as_matrix().reshape(-1, 1)
X_train = sp.sparse.hstack((X_train, X_amount))
X_test_amount = test_data["Amount"].as_matrix().reshape(-1, 1)
X_test = sp.sparse.hstack((X_test, X_test_amount))
但是,我认为您最终会得到 ValueError: Input X must be non-negative
,因为 MultinomialNB
旨在与 non-negative 特征值一起使用...