段落向量模型的交叉验证
Cross-validation for paragraph-vector model
我刚刚在尝试对段落向量模型应用交叉验证时遇到错误:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(size=10, min_count=1, iter=5, seed=1)
clf = LogisticRegression(random_state=0)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_train, y_train)
print("Score:", score) # This works
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=3)
print("Cross-Validation:", cval) # This doesn't work
KeyError: 0
我尝试用 model.transform(X_train)
或 model.fit_transform(X_train)
替换 cross_val_score
中的 X_train
。此外,我对原始输入数据 (data.text
) 进行了同样的尝试,而不是预处理文本。我怀疑与 Pipeline 的 .score
函数相比,交叉验证的 X_train
格式一定有问题,后者工作得很好。我还注意到 cross_val_score
与 CountVectorizer()
一起工作。
有人发现错误了吗?
不,这与model
的转换无关。它与 cross_val_score
.
有关
cross_val_score
将根据 cv
参数拆分提供的数据。为此,它会做这样的事情:
for train, test in splitter.split(X_train, y_train):
new_X_train, new_y_train = X_train[train], y_train[train]
但是您的 X_train
是一个 pandas.Series
对象,其中基于索引的选择不能像这样工作。看这个:https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position
更改此行:
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
至:
# Access the internal numpy array
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).values
OR
# Convert series to list
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).tolist()
我刚刚在尝试对段落向量模型应用交叉验证时遇到错误:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(size=10, min_count=1, iter=5, seed=1)
clf = LogisticRegression(random_state=0)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_train, y_train)
print("Score:", score) # This works
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=3)
print("Cross-Validation:", cval) # This doesn't work
KeyError: 0
我尝试用 model.transform(X_train)
或 model.fit_transform(X_train)
替换 cross_val_score
中的 X_train
。此外,我对原始输入数据 (data.text
) 进行了同样的尝试,而不是预处理文本。我怀疑与 Pipeline 的 .score
函数相比,交叉验证的 X_train
格式一定有问题,后者工作得很好。我还注意到 cross_val_score
与 CountVectorizer()
一起工作。
有人发现错误了吗?
不,这与model
的转换无关。它与 cross_val_score
.
cross_val_score
将根据 cv
参数拆分提供的数据。为此,它会做这样的事情:
for train, test in splitter.split(X_train, y_train):
new_X_train, new_y_train = X_train[train], y_train[train]
但是您的 X_train
是一个 pandas.Series
对象,其中基于索引的选择不能像这样工作。看这个:https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position
更改此行:
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
至:
# Access the internal numpy array
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).values
OR
# Convert series to list
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).tolist()