为标记级文本分类构建 n-gram
Building n-grams for token level text classification
我正在尝试使用 scikit-learn 在令牌级别class验证多class 数据。我已经有 train
和 test
拆分。令牌以相同 class 的批次出现,例如前 10 个标记属于 class0
,接下来的 20 个标记属于 class4
,依此类推。
数据采用以下 \t
分隔格式:
-----------------
token tag
-----------------
way 6
to 6
reduce 6
the 6
amount 6
of 6
traffic 6
....
public 2
transport 5
is 5
a 5
key 5
factor 5
to 5
minimize 5
....
数据分布如下:
Training Data Test Data
# Total: 119490 29699
# Class 0: 52631 13490
# Class 1: 35116 8625
# Class 2: 17968 4161
# Class 3: 8658 2088
# Class 4: 3002 800
# Class 5: 1201 302
# Class 6: 592 153
我正在尝试的代码是:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
if __name__ == '__main__':
# reading Files
train_df = pd.read_csv(TRAINING_DATA_PATH, names=['token', 'tag'], sep='\t').dropna().reset_index(drop=True)
test_df = pd.read_csv(TEST_DATA_PATH, names=['token', 'tag'], sep='\t')
# getting training and testing data
train_X = train_df['token']
test_X = test_df['token'].astype('U')
train_y = train_df['tag']
test_y = test_df['tag'].astype('U')
# Naive-Bayes
nb_pipeline = Pipeline([('vect', CountVectorizer()), # Counts occurrences of each word
('tfidf', TfidfTransformer()), # Normalize the counts based on document length
])
f1_list = []
cv = KFold(n_splits=5)
for train_index, test_index in cv.split(train_X):
train_text = train_X[train_index]
train_label = train_y[train_index]
val_text = train_X[test_index]
val_y = train_y[test_index]
vectorized_text = nb_pipeline.fit_transform(train_text)
sm = SMOTE(random_state=42)
train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_label)
print("\nTraining Data Class Distribution:")
print(train_label.value_counts())
print("\nRe-sampled Training Data Class Distribution:")
print(train_y_res.value_counts())
# clf = SVC(kernel='rbf', max_iter=1000, class_weight='balanced', verbose=1)
clf = MultinomialNB()
# clf = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=100, tol=None,
# n_jobs=-1, verbose=1)
clf.fit(train_text_res, train_y_res)
predictions = clf.predict(nb_pipeline.transform(val_text))
f1 = f1_score(val_y, predictions, average='macro')
f1_list.append(f1)
print(f1_list)
pred = clf.predict(nb_pipeline.transform(test_X))
print('F1-macro: %s' % f1_score(pred, test_y, average='macro'))
我想构建 n-grams
并将其作为一项功能添加到模型中,以便它可以更好地理解上下文,但我不确定这将如何工作,因为测试将再次在令牌级别完成。我如何构建 n-gram 并将其提供给模型,然后再次在令牌级别预测测试数据?
而不是:
nb_pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer())])
立即计数,tfidf 用于 unigrams 和 bigrams:
from sklearn.feature_extraction.text import TfidfVectorizer
nb_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 2)))])
有关更多信息,请参阅 docs。
我正在尝试使用 scikit-learn 在令牌级别class验证多class 数据。我已经有 train
和 test
拆分。令牌以相同 class 的批次出现,例如前 10 个标记属于 class0
,接下来的 20 个标记属于 class4
,依此类推。
数据采用以下 \t
分隔格式:
-----------------
token tag
-----------------
way 6
to 6
reduce 6
the 6
amount 6
of 6
traffic 6
....
public 2
transport 5
is 5
a 5
key 5
factor 5
to 5
minimize 5
....
数据分布如下:
Training Data Test Data
# Total: 119490 29699
# Class 0: 52631 13490
# Class 1: 35116 8625
# Class 2: 17968 4161
# Class 3: 8658 2088
# Class 4: 3002 800
# Class 5: 1201 302
# Class 6: 592 153
我正在尝试的代码是:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
if __name__ == '__main__':
# reading Files
train_df = pd.read_csv(TRAINING_DATA_PATH, names=['token', 'tag'], sep='\t').dropna().reset_index(drop=True)
test_df = pd.read_csv(TEST_DATA_PATH, names=['token', 'tag'], sep='\t')
# getting training and testing data
train_X = train_df['token']
test_X = test_df['token'].astype('U')
train_y = train_df['tag']
test_y = test_df['tag'].astype('U')
# Naive-Bayes
nb_pipeline = Pipeline([('vect', CountVectorizer()), # Counts occurrences of each word
('tfidf', TfidfTransformer()), # Normalize the counts based on document length
])
f1_list = []
cv = KFold(n_splits=5)
for train_index, test_index in cv.split(train_X):
train_text = train_X[train_index]
train_label = train_y[train_index]
val_text = train_X[test_index]
val_y = train_y[test_index]
vectorized_text = nb_pipeline.fit_transform(train_text)
sm = SMOTE(random_state=42)
train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_label)
print("\nTraining Data Class Distribution:")
print(train_label.value_counts())
print("\nRe-sampled Training Data Class Distribution:")
print(train_y_res.value_counts())
# clf = SVC(kernel='rbf', max_iter=1000, class_weight='balanced', verbose=1)
clf = MultinomialNB()
# clf = SGDClassifier(loss='log', penalty='l2', alpha=1e-3, max_iter=100, tol=None,
# n_jobs=-1, verbose=1)
clf.fit(train_text_res, train_y_res)
predictions = clf.predict(nb_pipeline.transform(val_text))
f1 = f1_score(val_y, predictions, average='macro')
f1_list.append(f1)
print(f1_list)
pred = clf.predict(nb_pipeline.transform(test_X))
print('F1-macro: %s' % f1_score(pred, test_y, average='macro'))
我想构建 n-grams
并将其作为一项功能添加到模型中,以便它可以更好地理解上下文,但我不确定这将如何工作,因为测试将再次在令牌级别完成。我如何构建 n-gram 并将其提供给模型,然后再次在令牌级别预测测试数据?
而不是:
nb_pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer())])
立即计数,tfidf 用于 unigrams 和 bigrams:
from sklearn.feature_extraction.text import TfidfVectorizer
nb_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 2)))])
有关更多信息,请参阅 docs。