机器学习:代表第一个数据集训练的分类器预测第二个数据集
Machine Learning: Predict second dataset on behalf of first dataset trained classifier
我是 "Machine Learning" 的新手并尝试实施 但我不清楚。我已经诱惑了 2 个月,所以请帮助我解决我的错误。
实际上,我正在尝试:
- "Train svm classifer" 在 TRAIN_features 和 TRAIN_labels 从 TRAIN_dataset 中提取,形状为 (98962,),大小为 98962
- "Test svm classifer" 在 TEST_features 从另一个数据集提取,即 TEST_dataset 与 TRAIN_dataset[= 相同的形状 (98962,) 和大小 98962 75=] 是.
在"preprocessing"之后 "TRAIN_features" & "TEST_features",在 "TfidfVectorizer" 的帮助下,我对两个特征进行了矢量化。之后我再次计算了两个特征的形状和大小,即
vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)
"processed_TRAIN_features" 大小变为 1032665 并且 "shape" 变为(98962, 9434)
vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)
"processed_TEST_features" 大小变为 1457961 并且 "shape" 变为(98962, 10782)
我知道我什么时候会 "TRAIN" svm 分类器 processed_TRAIN_features 什么时候 "predict"和"processed_TEST_features"使用相同的分类器,会产生错误,因为"shape" 和 "size" 两个特征都变得不同了。
我认为,解决此问题的唯一方法是 "reshape" 稀疏矩阵 (numpy.float64) processed_TEST_features 或 processed_TRAIN_features... 我认为重塑为 "processed_TRAIN_features" 是可能的,因为它大小小于 "processed_TEST_features" 或者还有其他方法可以实现我的上述观点 (1,2)。关于我的问题,我无法实施 ,并且仍在寻找它如何变得等于 "processed_TEST_features" w.r.t 形状和大小。
如果你们中有人可以为我做这件事...请提前致谢。
完整代码如下:
DataPath2 = ".../train.csv"
TRAIN_dataset = pd.read_csv(DataPath2)
DataPath1 = "..../completeDATAset.csv"
TEST_dataset = pd.read_csv(DataPath1)
TRAIN_features = TRAIN_dataset.iloc[:, 1 ].values
TRAIN_labels = TRAIN_dataset.iloc[:,0].values
TEST_features = TEST_dataset.iloc[:, 1 ].values
TEST_labeels = TEST_dataset.iloc[:,0].values
lab_enc = preprocessing.LabelEncoder()
TEST_labels = lab_enc.fit_transform(TEST_labeels)
processed_TRAIN_features = []
for sentence in range(0, len(TRAIN_features)):
# Remove all the special characters
processed_feature = re.sub(r'\W', ' ', str(TRAIN_features[sentence]))
# remove all single characters
processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
#remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature)
# remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature)
# remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature)
# Remove single characters from the start
processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)
# Substituting multiple spaces with single space
processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
#remove links
processed_feature = re.sub(r"http\S+", "", processed_feature)
# Removing prefixed 'b'
processed_feature = re.sub(r'^b\s+', '', processed_feature)
#removing rt
processed_feature = re.sub(r'^rt\s+', '', processed_feature)
# Converting to Lowercase
processed_feature = processed_feature.lower()
processed_TRAIN_features.append(processed_feature)
vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)
processed_TEST_features = []
for sentence in range(0, len(TEST_features)):
# Remove all the special characters
processed_feature1 = re.sub(r'\W', ' ', str(TEST_features[sentence]))
# remove all single characters
processed_feature1 = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature1)
#remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature1)
# remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature1)
# remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature1)
# Remove single characters from the start
processed_feature1 = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature1)
# Substituting multiple spaces with single space
processed_feature1 = re.sub(r'\s+', ' ', processed_feature1, flags=re.I)
#remove links
processed_feature1 = re.sub(r"http\S+", "", processed_feature1)
# Removing prefixed 'b'
processed_feature1 = re.sub(r'^b\s+', '', processed_feature1)
#removing rt
processed_feature1 = re.sub(r'^rt\s+', '', processed_feature1)
# Converting to Lowercase
processed_feature1 = processed_feature1.lower()
processed_TEST_features.append(processed_feature1)
vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)
X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(processed_TRAIN_features, TRAIN_labels, test_size=0.3, random_state=0)
text_classifier = svm.SVC(kernel='linear', class_weight="balanced" ,probability=True ,C=1 , random_state=0)
text_classifier.fit(X_train_data, y_train_data)
text_classifier.predict(processed_TEST_features)
标题编辑:预测数据集分类 => 预测数据集
processed_TRAIN_features = csr_matrix((processed_TRAIN_features),shape=(new row length,new column length))
我是 "Machine Learning" 的新手并尝试实施
实际上,我正在尝试:
- "Train svm classifer" 在 TRAIN_features 和 TRAIN_labels 从 TRAIN_dataset 中提取,形状为 (98962,),大小为 98962
- "Test svm classifer" 在 TEST_features 从另一个数据集提取,即 TEST_dataset 与 TRAIN_dataset[= 相同的形状 (98962,) 和大小 98962 75=] 是.
在"preprocessing"之后 "TRAIN_features" & "TEST_features",在 "TfidfVectorizer" 的帮助下,我对两个特征进行了矢量化。之后我再次计算了两个特征的形状和大小,即
vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)
"processed_TRAIN_features" 大小变为 1032665 并且 "shape" 变为(98962, 9434)
vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)
"processed_TEST_features" 大小变为 1457961 并且 "shape" 变为(98962, 10782)
我知道我什么时候会 "TRAIN" svm 分类器 processed_TRAIN_features 什么时候 "predict"和"processed_TEST_features"使用相同的分类器,会产生错误,因为"shape" 和 "size" 两个特征都变得不同了。
我认为,解决此问题的唯一方法是 "reshape" 稀疏矩阵 (numpy.float64) processed_TEST_features 或 processed_TRAIN_features... 我认为重塑为 "processed_TRAIN_features" 是可能的,因为它大小小于 "processed_TEST_features" 或者还有其他方法可以实现我的上述观点 (1,2)。关于我的问题,我无法实施
如果你们中有人可以为我做这件事...请提前致谢。
完整代码如下:
DataPath2 = ".../train.csv"
TRAIN_dataset = pd.read_csv(DataPath2)
DataPath1 = "..../completeDATAset.csv"
TEST_dataset = pd.read_csv(DataPath1)
TRAIN_features = TRAIN_dataset.iloc[:, 1 ].values
TRAIN_labels = TRAIN_dataset.iloc[:,0].values
TEST_features = TEST_dataset.iloc[:, 1 ].values
TEST_labeels = TEST_dataset.iloc[:,0].values
lab_enc = preprocessing.LabelEncoder()
TEST_labels = lab_enc.fit_transform(TEST_labeels)
processed_TRAIN_features = []
for sentence in range(0, len(TRAIN_features)):
# Remove all the special characters
processed_feature = re.sub(r'\W', ' ', str(TRAIN_features[sentence]))
# remove all single characters
processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
#remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature)
# remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature)
# remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature)
# Remove single characters from the start
processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)
# Substituting multiple spaces with single space
processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
#remove links
processed_feature = re.sub(r"http\S+", "", processed_feature)
# Removing prefixed 'b'
processed_feature = re.sub(r'^b\s+', '', processed_feature)
#removing rt
processed_feature = re.sub(r'^rt\s+', '', processed_feature)
# Converting to Lowercase
processed_feature = processed_feature.lower()
processed_TRAIN_features.append(processed_feature)
vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)
processed_TEST_features = []
for sentence in range(0, len(TEST_features)):
# Remove all the special characters
processed_feature1 = re.sub(r'\W', ' ', str(TEST_features[sentence]))
# remove all single characters
processed_feature1 = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature1)
#remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature1)
# remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature1)
# remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature1)
# Remove single characters from the start
processed_feature1 = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature1)
# Substituting multiple spaces with single space
processed_feature1 = re.sub(r'\s+', ' ', processed_feature1, flags=re.I)
#remove links
processed_feature1 = re.sub(r"http\S+", "", processed_feature1)
# Removing prefixed 'b'
processed_feature1 = re.sub(r'^b\s+', '', processed_feature1)
#removing rt
processed_feature1 = re.sub(r'^rt\s+', '', processed_feature1)
# Converting to Lowercase
processed_feature1 = processed_feature1.lower()
processed_TEST_features.append(processed_feature1)
vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)
X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(processed_TRAIN_features, TRAIN_labels, test_size=0.3, random_state=0)
text_classifier = svm.SVC(kernel='linear', class_weight="balanced" ,probability=True ,C=1 , random_state=0)
text_classifier.fit(X_train_data, y_train_data)
text_classifier.predict(processed_TEST_features)
标题编辑:预测数据集分类 => 预测数据集
processed_TRAIN_features = csr_matrix((processed_TRAIN_features),shape=(new row length,new column length))