拟合多标签文本分类模型时出现错误
Bugs when fitting Multi label text classification models
我现在正在尝试为多标签文本分类问题拟合分类模型。
我有一个训练集 X_train
,其中包含已清理文本的列表,例如
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
和X_train
中每个文本对应的train多个标签集y
,比如
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
现在我想要拟合一个模型,该模型可以预测与 X_train
.
具有相同格式的文本集 X_test
中的标签
我已经使用 MultiLabelBinarizer
转换标签并使用 TfidfVectorizer
转换训练集中清理后的文本。
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
但是当我尝试拟合模型时,我总是得到 bugs.I have tried OneVsRestClassifier
and LogisticRegression
.
当我拟合 OneVsRestClassifier
模型时,我遇到了像
这样的错误
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
当我拟合 LogisticRegression
模型时,我遇到了像
这样的错误
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
谁知道问题出在哪里以及如何解决?非常感谢。
OneVsRestClassifier 每个 class 适合一个 classifier。您需要告诉它您想要哪种类型的 classifier(例如 Losgistic 回归)。
以下代码适用于我:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
输出:数组([[0, 1, 1, 0, 0, 1]])
我现在正在尝试为多标签文本分类问题拟合分类模型。
我有一个训练集 X_train
,其中包含已清理文本的列表,例如
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
和X_train
中每个文本对应的train多个标签集y
,比如
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
现在我想要拟合一个模型,该模型可以预测与 X_train
.
X_test
中的标签
我已经使用 MultiLabelBinarizer
转换标签并使用 TfidfVectorizer
转换训练集中清理后的文本。
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
但是当我尝试拟合模型时,我总是得到 bugs.I have tried OneVsRestClassifier
and LogisticRegression
.
当我拟合 OneVsRestClassifier
模型时,我遇到了像
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
当我拟合 LogisticRegression
模型时,我遇到了像
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
谁知道问题出在哪里以及如何解决?非常感谢。
OneVsRestClassifier 每个 class 适合一个 classifier。您需要告诉它您想要哪种类型的 classifier(例如 Losgistic 回归)。
以下代码适用于我:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
输出:数组([[0, 1, 1, 0, 0, 1]])