在 CountVectorizer 中使用 `transform` 与 `fit_transform` 的问题

Issue with usages of `transform` vs. `fit_transform` in CountVectorizer

我已经使用 CountVectorizer() 成功地训练和测试了逻辑回归模型:

def train_model(classifier, feature_vector_train, label):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    return classifier

def getPredictions (classifier, feature_vector_valid):    
    # predict the labels on validation dataset
    predict = classifier.predict(feature_vector_valid)

    return metrics.accuracy_score(predict, valid_y)

def createTrainingAndValidation(column):
    global train_x, valid_x, train_y, valid_y
    train_x, valid_x, train_y, valid_y = model_selection.train_test_split(finalDF[column], finalDF['DeedType1'])

def createCountVectorizer(column):
    global xtrain_count, xvalid_count
    # create a count vectorizer object 
    count_vect = CountVectorizer()
    count_vect.fit(finalDF[column])

    # transform the training and validation data using count vectorizer object
    xtrain_count =  count_vect.transform(train_x)
    xvalid_count =  count_vect.transform(valid_x)

createTrainingAndValidation('Test')
createCountVectorizer('Test')
classifier = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
predictions = getPredictions(classifier, xvalid_count)

我正在使用一个名为 finalDF 的 DataFrame,其中包含所有带标签的文本。由于此模型的精度为 0.68,因此我打算在带有未知标签的 DataFrame 子集上对其进行测试。这不包括在训练和测试阶段。我将训练好的模型保存为 bestClassifier.

现在我得到了未知文本的子集并尝试执行以下操作:

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count =  count_vect.transform(text)

bestClassifier.predict(xvalid_count)

finalDF 有 800 行,而 unknownDf 在我上面的操作后只有 32 行。我该如何纠正这个问题?

我想我明白发生了什么,在这段代码中:

def createCountVectorizer(column):
    global xtrain_count, xvalid_count
    # create a count vectorizer object 
    count_vect = CountVectorizer()
    count_vect.fit(finalDF[column])

    # transform the training and validation data using count vectorizer object
    xtrain_count =  count_vect.transform(train_x)
    xvalid_count =  count_vect.transform(valid_x)

您正在声明 CountVectorizer(),调用 fit,然后调用 transform。您需要做的是,在 unknownDf['Text'] 上使用相同的 CountVectorizer()transform

当你这样做时:

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])
text = unknownDf['Text']
xvalid_count =  count_vect.transform(text)

您正在创建一个全新的 CountVectorizer(),为 unknownDf['Text'] 创建一个新的词袋,而您应该做的是删除这两行

count_vect = CountVectorizer()
count_vect.fit(unknownDf['Text'])

并让您在 finalDF[column]FIT 的现有 CountVectorizer(),将其用于 transform unknownDf['Text']

想办法将 createCountVectorizer(column) 中声明为 count_vectCountVectorizer() 用于 transform unknownDf['Text']