用一个文件完全训练和另一个文件完全测试分类
Classification with one file with entirely the training and another file with entirely test
我正在尝试进行分类,其中一个文件完全是训练文件,另一个文件完全是测试文件。这是可能的?我试过了:
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')
#csv file from test
df_test = pd.read_csv('data_test.csv', sep = ',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
df_test = df_test.reindex(np.random.permutation(df_test.index))
vect = CountVectorizer()
X = vect.fit_transform(df['data_train'])
y = df['label']
X_T = vect.fit_transform(df_test['data_test'])
y_t = df_test['label']
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
tf_transformer = TfidfTransformer(use_idf=False).fit(X)
X_train_tf = tf_transformer.transform(X)
X_train_tf.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X)
X_train_tfidf.shape
tf_transformer = TfidfTransformer(use_idf=False).fit(X_T)
X_train_tf_teste = tf_transformer.transform(X_T)
X_train_tf_teste.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf_teste = tfidf_transformer.fit_transform(X_T)
X_train_tfidf_teste.shape
#RegLog
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, y_pred, labels = y))
print("F-score")
print(f1_score(y_test, y_pred, average=None))
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None))
print("cross validation")
scores = cross_validation.cross_val_score(clf, X, y, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
我已将 test_size 设置为零,因为我不想在这些文件中有一个分区。而且我还在训练和测试文件中应用了Count和TFIDF。
我的输出错误:
Traceback (most recent call last):
File "classif.py", line 34, in
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
ValueError: too many values to unpack (expected 2)
所以首先,对于你得到的错误,只需编写如下代码,它应该可以工作。
X_train, y_train,_,_ = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test,_,_ = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
代码为return 4 组,预计您有 4 个变量来接收它们。放_
只是为了让大家知道你不关心那些输出。
其次,我真的不知道你为什么要进行这种操作。如果你想洗牌数据,这不是最好的方法。而且你已经做过了。
您在 train_test_split 中遇到的错误已由 @Alexis 明确指出并解决。我也再次建议不要使用 train_test_split,因为它除了洗牌之外不会做任何事情,你已经完成了。
但我想强调另一个重点,即,如果您将训练文件和测试文件分开保存,那么就不要单独安装矢量化器。它将为训练和测试文件创建不同的列。示例:
cv = CountVectorizer()
train=['Hi this is stack overflow']
cv.fit(train)
cv.get_feature_names()
输出:
['hi', 'is', 'overflow', 'stack', 'this']
test=['Hi that is not stack overflow']
cv.fit(test)
cv.get_feature_names()
输出:
['hi', 'is', 'not', 'overflow', 'stack', 'that']
因此,单独安装它们会导致列不匹配。所以,你应该先合并训练和测试文件,然后 fit_transform vectorizer 一起合并,或者如果你事先没有测试数据,你只能使用安装在训练数据上的 vectorizer 转换测试数据,这将忽略不出现在火车数据中。
我正在尝试进行分类,其中一个文件完全是训练文件,另一个文件完全是测试文件。这是可能的?我试过了:
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')
#csv file from test
df_test = pd.read_csv('data_test.csv', sep = ',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
df_test = df_test.reindex(np.random.permutation(df_test.index))
vect = CountVectorizer()
X = vect.fit_transform(df['data_train'])
y = df['label']
X_T = vect.fit_transform(df_test['data_test'])
y_t = df_test['label']
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
tf_transformer = TfidfTransformer(use_idf=False).fit(X)
X_train_tf = tf_transformer.transform(X)
X_train_tf.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X)
X_train_tfidf.shape
tf_transformer = TfidfTransformer(use_idf=False).fit(X_T)
X_train_tf_teste = tf_transformer.transform(X_T)
X_train_tf_teste.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf_teste = tfidf_transformer.fit_transform(X_T)
X_train_tfidf_teste.shape
#RegLog
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, y_pred, labels = y))
print("F-score")
print(f1_score(y_test, y_pred, average=None))
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None))
print("cross validation")
scores = cross_validation.cross_val_score(clf, X, y, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
我已将 test_size 设置为零,因为我不想在这些文件中有一个分区。而且我还在训练和测试文件中应用了Count和TFIDF。
我的输出错误:
Traceback (most recent call last):
File "classif.py", line 34, in X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
ValueError: too many values to unpack (expected 2)
所以首先,对于你得到的错误,只需编写如下代码,它应该可以工作。
X_train, y_train,_,_ = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test,_,_ = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
代码为return 4 组,预计您有 4 个变量来接收它们。放_
只是为了让大家知道你不关心那些输出。
其次,我真的不知道你为什么要进行这种操作。如果你想洗牌数据,这不是最好的方法。而且你已经做过了。
您在 train_test_split 中遇到的错误已由 @Alexis 明确指出并解决。我也再次建议不要使用 train_test_split,因为它除了洗牌之外不会做任何事情,你已经完成了。
但我想强调另一个重点,即,如果您将训练文件和测试文件分开保存,那么就不要单独安装矢量化器。它将为训练和测试文件创建不同的列。示例:
cv = CountVectorizer()
train=['Hi this is stack overflow']
cv.fit(train)
cv.get_feature_names()
输出:
['hi', 'is', 'overflow', 'stack', 'this']
test=['Hi that is not stack overflow']
cv.fit(test)
cv.get_feature_names()
输出:
['hi', 'is', 'not', 'overflow', 'stack', 'that']
因此,单独安装它们会导致列不匹配。所以,你应该先合并训练和测试文件,然后 fit_transform vectorizer 一起合并,或者如果你事先没有测试数据,你只能使用安装在训练数据上的 vectorizer 转换测试数据,这将忽略不出现在火车数据中。