NLP 的工作流程
Workflow of NLP
我应该什么时候在 NLP 中对文本数据进行预处理和矩阵创建,在 train_test_split
之前还是之后?下面是我在 train_test_split
之前完成预处理和矩阵创建 (tfidf) 的示例代码。我想知道会不会有数据泄露?
corpus = []
for i in range(0 ,len(data1)):
review = re.sub('[^a-zA-Z]', ' ', data1['features'][i])
review = review.lower()
review = review.split()
review = [stemmer.stem(j) for j in review if not j in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features = 6000)
x = cv.fit_transform(corpus).toarray()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(data1['label'])
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 69,
stratify = y)
spam_model = MultinomialNB().fit(train_x, train_y)
pred = spam_model.predict(test_x)
c_matrix = confusion_matrix(test_y, pred)
acc_score = accuracy_score(test_y, pred)
如 official documentation TfidfVectorizer
class 和 max_features
参数中所述,仅保留 k-best 特征。
max_featuresint, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
如果您将 class 与测试集一起呈现,这将有助于更有效地 select 此功能,这就是数据泄漏(此场景基于您的问题,但在大多数情况下例,可见一斑!)。
机器学习中最安全的方法是在 prediction/evaluation 之前忽略测试集,就像不存在一样!
[更新]
您可以看到来自 kaggle 的示例,该示例在预拆分数据集 here 上使用矢量化器!
关于此概念的更多信息提到 here and here!
我应该什么时候在 NLP 中对文本数据进行预处理和矩阵创建,在 train_test_split
之前还是之后?下面是我在 train_test_split
之前完成预处理和矩阵创建 (tfidf) 的示例代码。我想知道会不会有数据泄露?
corpus = []
for i in range(0 ,len(data1)):
review = re.sub('[^a-zA-Z]', ' ', data1['features'][i])
review = review.lower()
review = review.split()
review = [stemmer.stem(j) for j in review if not j in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features = 6000)
x = cv.fit_transform(corpus).toarray()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(data1['label'])
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 69,
stratify = y)
spam_model = MultinomialNB().fit(train_x, train_y)
pred = spam_model.predict(test_x)
c_matrix = confusion_matrix(test_y, pred)
acc_score = accuracy_score(test_y, pred)
如 official documentation TfidfVectorizer
class 和 max_features
参数中所述,仅保留 k-best 特征。
max_featuresint, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
如果您将 class 与测试集一起呈现,这将有助于更有效地 select 此功能,这就是数据泄漏(此场景基于您的问题,但在大多数情况下例,可见一斑!)。 机器学习中最安全的方法是在 prediction/evaluation 之前忽略测试集,就像不存在一样!
[更新] 您可以看到来自 kaggle 的示例,该示例在预拆分数据集 here 上使用矢量化器! 关于此概念的更多信息提到 here and here!