使用 TF-IDF 进行电影评级预测

Movie Ratings prediction using TF-IDF

我有一个格式为-

的数据集

Movie_Name, TomatoCritics, Target_Variable

这里,TomatoCritics属性有不同用户对不同电影的自由文本。 Target_Variable 是一个二进制值(0 或 1),表示是否应该观看这部电影。

我正在使用TF-IDF来处理这个,我的代码如下-

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer


# Read textual training data-
text_training = pd.read_csv("Textual-Training_Data.csv")

# Read textual testing data-
text_testing = pd.read_csv("Textual-Testing_Data.csv")

# Get dimensions of training data-
text_training.shape
# (95, 3)

# Get dimensions of testing data-
text_testing.shape
# (224, 3)


# Check for missing values in training data-
text_training.isnull().values.any()
# True

# Check for missing values in testing data-
text_testing.isnull().values.any()
# True

# Remove any row having missing value from training data-
text_training_nona = text_training.dropna(axis = 0, how='any')

# Remove any row having missing value from testing data-
text_testing_nona = text_testing.dropna(axis = 0, how = 'any')

# Get dimensions of training data AFTER removing empty rows-
text_training_nona.shape
# (73, 3)

# Get dimensions of testing data AFTER removing empty rows-
text_testing_nona.shape
# (158, 3)


# Attributes to use for training and testing sets for ML-
cols_train = ['tomatoConsensus', 'goodforairplanes']
cols_test = ['tomatoConsensus', 'goodforairplanes']



# Split training dataset into features (X) and label (y) for training-
X_train = text_training_nona['tomatoConsensus']
y_train = text_training_nona['goodforairplanes']


# Split training dataset into features (X) and label (y) for testing-
X_test = text_testing_nona["tomatoConsensus"]
y_test = text_testing_nona['goodforairplanes']




# Initialize Count Vectorizer using TF-IDF ->
cv = TfidfVectorizer(min_df = 1, stop_words='english')

# Convert text to TF-IDF ->
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

# Multinomial Naive Bayes classifier-
mnb = MultinomialNB()

# Train model on training data-
mnb.fit(X_train_cv, y_train)

print(X_test_cv[0])
'''
(0, 1168)   0.20066499253877468
  (0, 31)   0.2419027475877309
  (0, 1090) 0.22790133982975397
  (0, 5)    0.2616366234663056
  (0, 877)  0.2616366234663056
  (0, 1279) 0.2419027475877309
  (0, 850)  0.1786670002268731
  (0, 1341) 0.2616366234663056
  (0, 2)    0.2616366234663056
  (0, 695)  0.2616366234663056
  (0, 1221) 0.2419027475877309
  (0, 884)  0.1786670002268731
  (0, 1070) 0.2616366234663056
  (0, 782)  0.2616366234663056
  (0, 252)  0.20066499253877468
  (0, 1259) 0.2419027475877309
  (0, 1093) 0.20816746395117927
  (0, 122)  0.2170410042381541
'''

y_pred = mnb.predict(X_test_cv[0])

最后一行使用 mnb.predict() 给出了错误-

ValueError: dimension mismatch

怎么了?

谢谢!

您应该 fit_transform 一次,然后使用现有的 cv 和经过训练的 cv 对象进行转换。变化

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

- 这应该可以解决您的问题。

如果你用额外的数据再次调用 fit_transofrm,它可能包含另一个数量的独特单词,它会产生另一个大小的词汇表 - 然后,mnb 的维度用其他数据训练和other 词汇表的大小会有所不同 - 这就是 ValueError: dimension mismatch.

编辑
只需检查这两种情况的 X_test_cvX_train_cv - 如果您 X_trainX_test 选择 fit_transform,它会给出不同的形状,但是如果您替换第二个 fit_transform fot 变换 - 它们将是相同的。