为什么 TfidVectorizer.fit_transform() 会更改文本数据的样本数和标签数?
Why does TfidVectorizer.fit_transform() change the number of samples and labels for my text data?
我有一个包含 3 列 310 数据的数据集。列都是文本。一列是用户在查询表单中输入的文本,第二列是标签(六个标签之一),说明输入属于哪个查询类别。
>>> data.shape
(310 x 3)
在通过 sklearn.cluster
中的 KMeans
算法 运行 之前,我正在对我的数据进行以下预处理
v = TfidfVectorizer()
vectorized = v.fit_transform(data)
现在,
>>> vectorized.shape
(3,4)
从我看的地方看,我似乎丢失了数据。我不再拥有我的 310 样本。我相信矢量化的形状是指 [n_samples, n_features]
.
为什么样本和特征的值会发生变化?我希望样本数为 310,特征数为 6(我标记数据的唯一分组数。
问题是 TfidfVectorizer()
不能同时应用于三列。
根据 documentation:
fit_transform(self, raw_documents, y=None)
Learn vocabulary and idf, return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently
implemented.
Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects
Returns: X : sparse matrix, [n_samples, n_features]
Tf-idf-weighted document-term matrix.
因此,仅适用于单列文本数据。在您的代码中,它刚刚遍历了列名并为其创建了一个转换。
一个例子来理解发生了什么:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
data = pd.DataFrame({'col1':['this is first sentence','this one is the second sentence'],
'col2':['this is first sentence','this one is the second sentence'],
'col3':['this is first sentence','this one is the second sentence'] })
vec = TfidfVectorizer()
vec.fit_transform(data).todense()
#
# matrix([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]])
vec.get_feature_names()
# ['col1', 'col2', 'col3']
现在,解决方案是您必须将所有三列合并为一列,或者
在每一列上单独应用矢量化器,然后将它们附加在末尾。
方法一
data.loc[:,'full_text'] = data.apply(lambda x: ' '.join(x), axis=1)
vec = TfidfVectorizer()
X = vec.fit_transform(data['full_text']).todense()
print(X.shape)
# (2, 7)
print(vec.get_feature_names())
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
方法二
from scipy.sparse import hstack
import numpy as np
vec={}
X = []
for col in data[['col1','col2','col3']]:
vec[col]= TfidfVectorizer()
X = np.append(X,
vec[col].fit_transform(data[col]))
stacked_X = hstack(X).todense()
stacked_X.shape
# (2, 21)
for col, v in vec.items():
print(col)
print(v.get_feature_names())
# col1
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
# col2
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
# col3
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
我有一个包含 3 列 310 数据的数据集。列都是文本。一列是用户在查询表单中输入的文本,第二列是标签(六个标签之一),说明输入属于哪个查询类别。
>>> data.shape
(310 x 3)
在通过 sklearn.cluster
KMeans
算法 运行 之前,我正在对我的数据进行以下预处理
v = TfidfVectorizer()
vectorized = v.fit_transform(data)
现在,
>>> vectorized.shape
(3,4)
从我看的地方看,我似乎丢失了数据。我不再拥有我的 310 样本。我相信矢量化的形状是指 [n_samples, n_features]
.
为什么样本和特征的值会发生变化?我希望样本数为 310,特征数为 6(我标记数据的唯一分组数。
问题是 TfidfVectorizer()
不能同时应用于三列。
根据 documentation:
fit_transform(self, raw_documents, y=None)
Learn vocabulary and idf, return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objectsReturns: X : sparse matrix, [n_samples, n_features]
Tf-idf-weighted document-term matrix.
因此,仅适用于单列文本数据。在您的代码中,它刚刚遍历了列名并为其创建了一个转换。
一个例子来理解发生了什么:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
data = pd.DataFrame({'col1':['this is first sentence','this one is the second sentence'],
'col2':['this is first sentence','this one is the second sentence'],
'col3':['this is first sentence','this one is the second sentence'] })
vec = TfidfVectorizer()
vec.fit_transform(data).todense()
#
# matrix([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]])
vec.get_feature_names()
# ['col1', 'col2', 'col3']
现在,解决方案是您必须将所有三列合并为一列,或者 在每一列上单独应用矢量化器,然后将它们附加在末尾。
方法一
data.loc[:,'full_text'] = data.apply(lambda x: ' '.join(x), axis=1)
vec = TfidfVectorizer()
X = vec.fit_transform(data['full_text']).todense()
print(X.shape)
# (2, 7)
print(vec.get_feature_names())
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
方法二
from scipy.sparse import hstack
import numpy as np
vec={}
X = []
for col in data[['col1','col2','col3']]:
vec[col]= TfidfVectorizer()
X = np.append(X,
vec[col].fit_transform(data[col]))
stacked_X = hstack(X).todense()
stacked_X.shape
# (2, 21)
for col, v in vec.items():
print(col)
print(v.get_feature_names())
# col1
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
# col2
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
# col3
# ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']