文本分类 + 朴素贝叶斯 + Python :输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值
Text classification + Naive Bayes + Python : Input contains NaN, infinity or a value too large for dtype('float64')
我正在尝试使用朴素贝叶斯进行文本分类。这是我的代码:
#splitting Pandas dataframe into train set and test set
x_train, x_test, y_train, y_test = cross_validation.train_test_split(data['description'], data['category_id'], test_size=0.2, random_state=42)
#production of bag of words from x_train
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)
train_vocab = count_vect.get_feature_names()
#training the Naive Bayes classifier
clf = MultinomialNB().fit(x_train_counts, y_train)
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-46-0cb3dc7193bf> in <module>()
1 #training the Naive Bayes classifier
2
----> 3 clf = MultinomialNB().fit(x_train_counts, y_train)
~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
577 Returns self.
578 """
--> 579 X, y = check_X_y(X, y, 'csr')
580 _, n_features = X.shape
581
~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
577 else:
578 y = column_or_1d(y, warn=True)
--> 579 _assert_all_finite(y)
580 if y_numeric and y.dtype.kind == 'O':
581 y = y.astype(np.float64)
~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
42 and not np.isfinite(X).all()):
43 raise ValueError("Input contains NaN, infinity"
---> 44 " or a value too large for %r." % X.dtype)
45
46
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
x_train_counts的类型是scipy.sparse.csr.csr_matrix。
print(type(x_train_counts))
<class 'scipy.sparse.csr.csr_matrix'>
y_train的类型是pandas.core.series.Series。
print(type(y_train))
<class 'pandas.core.series.Series'>
我怀疑这个问题与您的 data['description']
和 data['category_id']
有关。第一个类似于包含文本的 n 元素的数组,第二个类似对象的数组也包含包含 for 的标签的 n 元素第一个,例如 ['0', '1', '3', ...]
?
作为测试,只有用一些 sklearn 数据集替换你的数据才能产生正确的 运行:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
dataset = fetch_20newsgroups(subset='train',
categories=categories, shuffle=True, random_state=42)
x_train, x_test, y_train, y_test = cross_validation.train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=42)
#production of bag of words from x_train
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)
train_vocab = count_vect.get_feature_names()
#training the Naive Bayes classifier
clf = MultinomialNB().fit(x_train_counts, y_train)
尝试测试一下,如果有帮助请告诉我。
在train_test_split或从特征化生成测试和训练集以适合模型之前,最佳做法是使用以下命令
dataframe_name.isnull().any()
this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum()
this will give the column names and value of how many NaN values are present
这不会产生NaN 的问题。
我正在尝试使用朴素贝叶斯进行文本分类。这是我的代码:
#splitting Pandas dataframe into train set and test set
x_train, x_test, y_train, y_test = cross_validation.train_test_split(data['description'], data['category_id'], test_size=0.2, random_state=42)
#production of bag of words from x_train
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)
train_vocab = count_vect.get_feature_names()
#training the Naive Bayes classifier
clf = MultinomialNB().fit(x_train_counts, y_train)
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-46-0cb3dc7193bf> in <module>()
1 #training the Naive Bayes classifier
2
----> 3 clf = MultinomialNB().fit(x_train_counts, y_train)
~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
577 Returns self.
578 """
--> 579 X, y = check_X_y(X, y, 'csr')
580 _, n_features = X.shape
581
~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
577 else:
578 y = column_or_1d(y, warn=True)
--> 579 _assert_all_finite(y)
580 if y_numeric and y.dtype.kind == 'O':
581 y = y.astype(np.float64)
~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
42 and not np.isfinite(X).all()):
43 raise ValueError("Input contains NaN, infinity"
---> 44 " or a value too large for %r." % X.dtype)
45
46
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
x_train_counts的类型是scipy.sparse.csr.csr_matrix。
print(type(x_train_counts))
<class 'scipy.sparse.csr.csr_matrix'>
y_train的类型是pandas.core.series.Series。
print(type(y_train))
<class 'pandas.core.series.Series'>
我怀疑这个问题与您的 data['description']
和 data['category_id']
有关。第一个类似于包含文本的 n 元素的数组,第二个类似对象的数组也包含包含 for 的标签的 n 元素第一个,例如 ['0', '1', '3', ...]
?
作为测试,只有用一些 sklearn 数据集替换你的数据才能产生正确的 运行:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
dataset = fetch_20newsgroups(subset='train',
categories=categories, shuffle=True, random_state=42)
x_train, x_test, y_train, y_test = cross_validation.train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=42)
#production of bag of words from x_train
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)
train_vocab = count_vect.get_feature_names()
#training the Naive Bayes classifier
clf = MultinomialNB().fit(x_train_counts, y_train)
尝试测试一下,如果有帮助请告诉我。
在train_test_split或从特征化生成测试和训练集以适合模型之前,最佳做法是使用以下命令
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
这不会产生NaN 的问题。