如何在由一个列表组成的嵌套列表上使用 RandomizedSearchCV?
How to use RandomizedSearchCV on a nested list consisting of one list?
我构建了一个句子边界检测分类器。对于序列标记,我使用了条件随机场。对于超参数优化,我想使用 RandomizedSearchCV。我的训练数据包含 6 个带注释的文本。我将所有 6 个文本合并到一个标记列表中。对于实现,我遵循了 documentation 中的示例。这是我的简化代码:
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
import scipy.stats
#my tokenlist has the length n
X_train = [feature_dict_token_1, ... , feature_dict_token_n]
# 3 types of tags, B-SEN for begin of sentence; E-SEN for end of sentence; O-Others
y_train = [tag_token_1, ..., tag_token_n]
# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
max_iterations=100,
all_possible_transitions=True
)
params_space = {
'c1': scipy.stats.expon(scale=0.5),
'c2': scipy.stats.expon(scale=0.05),
}
labels = ['B-SEN', 'E-SEN', 'O']
# use F1-score for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
average='weighted', labels=labels)
# search
rs = RandomizedSearchCV(crf, params_space,
cv=3,
verbose=1,
n_jobs=-1,
n_iter=50,
scoring=f1_scorer)
rs.fit([X_train], [y_train])
我使用 rs.fit([X_train], [y_train])
而不是 rs.fit(X_train, y_train)
,因为 crf.train 的 documentation 说,它需要一个列表列表:
fit(X, y, X_dev=None, y_dev=None)
Parameters:
-X (list of lists of dicts) – Feature dicts for several documents (in a python-crfsuite format).
-y (list of lists of strings) – Labels for several documents.
-X_dev ((optional) list of lists of dicts) – Feature dicts used for testing.
-y_dev ((optional) list of lists of strings) – Labels corresponding to X_dev.
但是使用列表的列表我得到这个错误:
ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=1
我理解是因为我分别使用了[X_train]和[y_train],无法将CV应用于由一个列表组成的列表,但是X_train 和 y_train crf.fit 不应对。
我该如何解决这个问题?
根据官方教程here,您的train/test集(即X_train
、X_test
)应该是字典列表的列表。例如:
[[{'bias': 1.0,
'word.lower()': 'melbourne',
'word[-3:]': 'rne',
'word[-2:]': 'ne',
'word.isupper()': False,
'word.istitle()': True,
'word.isdigit()': False,
'postag': 'NP'},
{'bias': 1.0,
'word.lower()': '(',
'word[-3:]': '(',
'word[-2:]': '(',
'word.isupper()': False,
'word.istitle()': False,
'word.isdigit()': False,
'postag': 'Fpa'},
...],
[{'bias': 1.0,
'word.lower()': '-',
'word[-3:]': '-',
'word[-2:]': '-',
'word.isupper()': False,
'word.istitle()': False,
'word.isdigit()': False,
'postag': 'Fg',
'postag[:2]': 'Fg'},
{'bias': 1.0,
'word.lower()': '25',
'word[-3:]': '25',
'word[-2:]': '25',
'word.isupper()': False,
'word.istitle()': False,
'word.isdigit()': True,
'postag': 'Z'
}]]
标签集(即y_tain
和y_test)
应该是字符串列表的列表。例如:
[['B-LOC', 'I-LOC'], ['B-ORG', 'O']]
然后你像往常一样拟合模型:
rs.fit(X_train, y_train)
请阅读上面提到的教程,看看它是如何工作的。
我构建了一个句子边界检测分类器。对于序列标记,我使用了条件随机场。对于超参数优化,我想使用 RandomizedSearchCV。我的训练数据包含 6 个带注释的文本。我将所有 6 个文本合并到一个标记列表中。对于实现,我遵循了 documentation 中的示例。这是我的简化代码:
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
import scipy.stats
#my tokenlist has the length n
X_train = [feature_dict_token_1, ... , feature_dict_token_n]
# 3 types of tags, B-SEN for begin of sentence; E-SEN for end of sentence; O-Others
y_train = [tag_token_1, ..., tag_token_n]
# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
max_iterations=100,
all_possible_transitions=True
)
params_space = {
'c1': scipy.stats.expon(scale=0.5),
'c2': scipy.stats.expon(scale=0.05),
}
labels = ['B-SEN', 'E-SEN', 'O']
# use F1-score for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
average='weighted', labels=labels)
# search
rs = RandomizedSearchCV(crf, params_space,
cv=3,
verbose=1,
n_jobs=-1,
n_iter=50,
scoring=f1_scorer)
rs.fit([X_train], [y_train])
我使用 rs.fit([X_train], [y_train])
而不是 rs.fit(X_train, y_train)
,因为 crf.train 的 documentation 说,它需要一个列表列表:
fit(X, y, X_dev=None, y_dev=None)
Parameters:
-X (list of lists of dicts) – Feature dicts for several documents (in a python-crfsuite format).
-y (list of lists of strings) – Labels for several documents.
-X_dev ((optional) list of lists of dicts) – Feature dicts used for testing.
-y_dev ((optional) list of lists of strings) – Labels corresponding to X_dev.
但是使用列表的列表我得到这个错误:
ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=1
我理解是因为我分别使用了[X_train]和[y_train],无法将CV应用于由一个列表组成的列表,但是X_train 和 y_train crf.fit 不应对。 我该如何解决这个问题?
根据官方教程here,您的train/test集(即X_train
、X_test
)应该是字典列表的列表。例如:
[[{'bias': 1.0,
'word.lower()': 'melbourne',
'word[-3:]': 'rne',
'word[-2:]': 'ne',
'word.isupper()': False,
'word.istitle()': True,
'word.isdigit()': False,
'postag': 'NP'},
{'bias': 1.0,
'word.lower()': '(',
'word[-3:]': '(',
'word[-2:]': '(',
'word.isupper()': False,
'word.istitle()': False,
'word.isdigit()': False,
'postag': 'Fpa'},
...],
[{'bias': 1.0,
'word.lower()': '-',
'word[-3:]': '-',
'word[-2:]': '-',
'word.isupper()': False,
'word.istitle()': False,
'word.isdigit()': False,
'postag': 'Fg',
'postag[:2]': 'Fg'},
{'bias': 1.0,
'word.lower()': '25',
'word[-3:]': '25',
'word[-2:]': '25',
'word.isupper()': False,
'word.istitle()': False,
'word.isdigit()': True,
'postag': 'Z'
}]]
标签集(即y_tain
和y_test)
应该是字符串列表的列表。例如:
[['B-LOC', 'I-LOC'], ['B-ORG', 'O']]
然后你像往常一样拟合模型:
rs.fit(X_train, y_train)
请阅读上面提到的教程,看看它是如何工作的。