文本数据的多标签核外学习:部分拟合的 ValueError
Multi-label out-of-core learning for text data: ValueError on partial fit
我正在尝试构建一个多标签的核外文本分类器。如 here, the idea is to read (large scale) text data sets in batches and partially fitting them to the classifiers. Additionally, when you have multi-label instances as described here 所述,其想法是以一对多的方式构建与数据集中 类 数量一样多的二元分类器。
将 sklearn 的 MultiLabelBinarizer 和 OneVsRestClassifier 类 与部分拟合相结合时,出现以下错误:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
代码如下:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
clf.partial_fit(X_train, Y_train, classes=categories)
你可以想象最后三行应用于每个小批量,为了简单起见我删除了代码。
如果删除 OneVsRestClassifier 并仅使用 MultinomialNB,代码运行正常。
所以答案可能与您预期的不同,但我建议您不要使用 OneVsRestClassifier,而应使用构建在 scikit-learn 之上的 scikit-multilearn 库,即提供 multi-label 分类器,它比简单的 OneVsRest 更先进。
您可以在 tutorial. A review of approaches to multi-label classification can be found in Tsoumakas's introduction to MLC 中找到如何使用 scikit-multilearn 的示例。
但是如果碰巧你有彼此 co-occurring 的标签,我建议使用不同的分类器,例如使用快速贪婪社区检测对标签 space 进行分类的标签幂集输出 space - 我解释了为什么这在 my paper about label space division.
中有效
将您的代码转换为使用 scikit-multilearn 会使它看起来如下所示:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset
categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
# base single-label classifier
base_classifier = MultinomialNB(alpha=0.01)
# problem transformation from multi-label to single-label
transformation_classifier = LabelPowerset(base_classifier)
# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True)
# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
clf.fit(X_train, Y_train)
您正在传递从 MultiLabelBinarizer
转换而来的 y_train,其格式为 [[1, 1, 0], [0, 1, 0], [1, 1, 0 ]],但将类别作为 ['a','b','c']
传递,然后通过 this line the code:-
if np.setdiff1d(y, self.classes_):
raise ValueError(("Mini-batch contains {0} while classes " +
"must be subset of {1}").format(np.unique(y),
self.classes_))
这会产生一个布尔值数组,例如 [False, True, ..]。
if
无法将此类数组作为单个真值处理,因此会出现错误。
首先,您应该以与 Y_train
相同的数字格式传递 类。
现在,即使您这样做了,internal label_binarizer_
of OneVsRestClassifier 也会决定它是 "multiclass" 类型而不是 multilabel
类型,然后将拒绝正确转换 类。在我看来,这是 OneVsRestClassifer and/or LabelBinarizer.
中的一个错误
请向 scikit-learn github 提交关于 partial_fit
的问题,看看会发生什么。
更新
显然,根据目标向量 (y) 决定 "multilabel" 或 "multiclass" 是 scikit-learn 上的一个当前问题,因为围绕它的所有复杂问题。
- https://github.com/scikit-learn/scikit-learn/issues/7665
- https://github.com/scikit-learn/scikit-learn/issues/5959
- https://github.com/scikit-learn/scikit-learn/issues/7931
- https://github.com/scikit-learn/scikit-learn/issues/8098
- https://github.com/scikit-learn/scikit-learn/issues/7628
- https://github.com/scikit-learn/scikit-learn/pull/2626
我正在尝试构建一个多标签的核外文本分类器。如 here, the idea is to read (large scale) text data sets in batches and partially fitting them to the classifiers. Additionally, when you have multi-label instances as described here 所述,其想法是以一对多的方式构建与数据集中 类 数量一样多的二元分类器。
将 sklearn 的 MultiLabelBinarizer 和 OneVsRestClassifier 类 与部分拟合相结合时,出现以下错误:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
代码如下:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
clf.partial_fit(X_train, Y_train, classes=categories)
你可以想象最后三行应用于每个小批量,为了简单起见我删除了代码。
如果删除 OneVsRestClassifier 并仅使用 MultinomialNB,代码运行正常。
所以答案可能与您预期的不同,但我建议您不要使用 OneVsRestClassifier,而应使用构建在 scikit-learn 之上的 scikit-multilearn 库,即提供 multi-label 分类器,它比简单的 OneVsRest 更先进。
您可以在 tutorial. A review of approaches to multi-label classification can be found in Tsoumakas's introduction to MLC 中找到如何使用 scikit-multilearn 的示例。
但是如果碰巧你有彼此 co-occurring 的标签,我建议使用不同的分类器,例如使用快速贪婪社区检测对标签 space 进行分类的标签幂集输出 space - 我解释了为什么这在 my paper about label space division.
中有效将您的代码转换为使用 scikit-multilearn 会使它看起来如下所示:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset
categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
# base single-label classifier
base_classifier = MultinomialNB(alpha=0.01)
# problem transformation from multi-label to single-label
transformation_classifier = LabelPowerset(base_classifier)
# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True)
# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
clf.fit(X_train, Y_train)
您正在传递从 MultiLabelBinarizer
转换而来的 y_train,其格式为 [[1, 1, 0], [0, 1, 0], [1, 1, 0 ]],但将类别作为 ['a','b','c']
传递,然后通过 this line the code:-
if np.setdiff1d(y, self.classes_):
raise ValueError(("Mini-batch contains {0} while classes " +
"must be subset of {1}").format(np.unique(y),
self.classes_))
这会产生一个布尔值数组,例如 [False, True, ..]。
if
无法将此类数组作为单个真值处理,因此会出现错误。
首先,您应该以与 Y_train
相同的数字格式传递 类。
现在,即使您这样做了,internal label_binarizer_
of OneVsRestClassifier 也会决定它是 "multiclass" 类型而不是 multilabel
类型,然后将拒绝正确转换 类。在我看来,这是 OneVsRestClassifer and/or LabelBinarizer.
请向 scikit-learn github 提交关于 partial_fit
的问题,看看会发生什么。
更新 显然,根据目标向量 (y) 决定 "multilabel" 或 "multiclass" 是 scikit-learn 上的一个当前问题,因为围绕它的所有复杂问题。
- https://github.com/scikit-learn/scikit-learn/issues/7665
- https://github.com/scikit-learn/scikit-learn/issues/5959
- https://github.com/scikit-learn/scikit-learn/issues/7931
- https://github.com/scikit-learn/scikit-learn/issues/8098
- https://github.com/scikit-learn/scikit-learn/issues/7628
- https://github.com/scikit-learn/scikit-learn/pull/2626