如何将 Kfold 与 TfidfVectorizer 一起使用?
How to apply Kfold with TfidfVectorizer?
我在使用 Tfidf 应用 K 折交叉验证时遇到问题。它给了我这个错误
ValueError: setting an array element with a sequence.
我看到其他问题也有同样的问题,但是他们用的是train_test_split() 和K-fold有点不同
for train_fold, valid_fold in kf.split(reviews_p1):
vec = TfidfVectorizer(ngram_range=(1,1))
reviews_p1 = vec.fit_transform(reviews_p1)
train_x = [reviews_p1[i] for i in train_fold] # Extract train data with train indices
train_y = [labels_p1[i] for i in train_fold] # Extract train data with train indices
valid_x = [reviews_p1[i] for i in valid_fold] # Extract valid data with cv indices
valid_y = [labels_p1[i] for i in valid_fold] # Extract valid data with cv indices
svc = LinearSVC()
model = svc.fit(X = train_x, y = train_y) # We fit the model with the fold train data
y_pred = model.predict(valid_x)
实际上,我发现问题出在哪里,但我找不到解决它的方法,基本上,当我们用 cv/train 索引提取火车数据时,我们得到一个稀疏矩阵列表
[<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 54 stored elements in Compressed Sparse Row format>,
<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 47 stored elements in Compressed Sparse Row format>,
<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 18 stored elements in Compressed Sparse Row format>, ....]
我尝试在拆分后的数据上应用 Tfidf,但由于特征数量不一样,所以没有用。
那么有什么方法可以在不创建稀疏矩阵列表的情况下将数据拆分为 K 折?
在对类似问题的回答中Do I use the same Tfidf vocabulary in k-fold cross_validation他们建议
for train_index, test_index in kf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)
我在使用 Tfidf 应用 K 折交叉验证时遇到问题。它给了我这个错误
ValueError: setting an array element with a sequence.
我看到其他问题也有同样的问题,但是他们用的是train_test_split() 和K-fold有点不同
for train_fold, valid_fold in kf.split(reviews_p1):
vec = TfidfVectorizer(ngram_range=(1,1))
reviews_p1 = vec.fit_transform(reviews_p1)
train_x = [reviews_p1[i] for i in train_fold] # Extract train data with train indices
train_y = [labels_p1[i] for i in train_fold] # Extract train data with train indices
valid_x = [reviews_p1[i] for i in valid_fold] # Extract valid data with cv indices
valid_y = [labels_p1[i] for i in valid_fold] # Extract valid data with cv indices
svc = LinearSVC()
model = svc.fit(X = train_x, y = train_y) # We fit the model with the fold train data
y_pred = model.predict(valid_x)
实际上,我发现问题出在哪里,但我找不到解决它的方法,基本上,当我们用 cv/train 索引提取火车数据时,我们得到一个稀疏矩阵列表
[<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 54 stored elements in Compressed Sparse Row format>,
<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 47 stored elements in Compressed Sparse Row format>,
<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 18 stored elements in Compressed Sparse Row format>, ....]
我尝试在拆分后的数据上应用 Tfidf,但由于特征数量不一样,所以没有用。
那么有什么方法可以在不创建稀疏矩阵列表的情况下将数据拆分为 K 折?
在对类似问题的回答中Do I use the same Tfidf vocabulary in k-fold cross_validation他们建议
for train_index, test_index in kf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)