fit_transform、transform 和 TfidfVectorizer 的工作原理
How fit_transform, transform and TfidfVectorizer works
我正在做一个模糊匹配项目,我发现了一个非常有趣的方法:awesome_cossim_top
我总体上理解了这个定义,但不明白当我们这样做时发生了什么fit_transform
import pandas as pd
import sqlite3 as sql
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct
import re
def ngrams(string, n=3):
string = re.sub(r'[,-./]|\sBD',r'', re.sub(' +', ' ',str(string)))
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def awesome_cossim_top(A, B, ntop, lower_bound=0):
# force A and B as a CSR matrix.
# If they have already been CSR, there is no overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape
idx_dtype = np.int32
nnz_max = M*ntop
indptr = np.zeros(M+1, dtype=idx_dtype)
indices = np.zeros(nnz_max, dtype=idx_dtype)
data = np.zeros(nnz_max, dtype=A.dtype)
ct.sparse_dot_topn(
M, N, np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr, dtype=idx_dtype),
np.asarray(B.indices, dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data)
print('ct.sparse_dot_topn: ', ct.sparse_dot_topn)
return csr_matrix((data,indices,indptr),shape=(M,N))
def get_matches_df(sparse_matrix, A, B, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = A[sparserows[index]]
right_side[index] = B[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'left_side': left_side,
'right_side': right_side,
'similairity': similairity})
这是我遇到困惑的脚本:
为什么我们应该先使用 fit_transform 然后只使用 SAME 向量化器进行转换。
我试图从矢量化器和矩阵打印一些输出,如 print(vectorizer.get_feature_names()) 但不理解其中的逻辑。
有谁能帮我解释一下吗?
非常感谢!!
Col_clean = 'fruits_normalized'
Col_dirty = 'fruits'
#read table
data_dirty={f'{Col_dirty}':['I am an apple', 'You are an apple', 'Aple', 'Appls', 'Apples']}
data_clean= {f'{Col_clean}':['apple', 'pear', 'banana', 'apricot', 'pineapple']}
df_clean = pd.DataFrame(data_clean)
df_dirty = pd.DataFrame(data_dirty)
Name_clean = df_clean[f'{Col_clean}'].unique()
Name_dirty= df_dirty[f'{Col_dirty}'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
clean_idf_matrix = vectorizer.fit_transform(Name_clean)
dirty_idf_matrix = vectorizer.transform(Name_dirty)
matches = awesome_cossim_top(dirty_idf_matrix, clean_idf_matrix.transpose(),1,0)
matches_df = get_matches_df(matches, Name_dirty, Name_clean, top = 0)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
matches_df.to_excel("output_apple.xlsx")
print('done')
TfidfVectorizer.fit_transform
用于从训练数据集创建词汇表,TfidfVectorizer.transform
用于将该词汇表映射到测试数据集,以便测试数据中的特征数量与训练数据保持相同。以下示例可能有所帮助:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
创建虚拟训练数据:
train = pd.DataFrame({'Text' :['I am a data scientist','Cricket is my favorite sport', 'I work on Python regularly', 'Python is very fast for data mining', 'I love playing cricket'],
'Category' :['Data_Science','Cricket','Data_Science','Data_Science','Cricket']})
还有一个小测试数据:
test = pd.DataFrame({'Text' :['I am new to data science field', 'I play cricket on weekends', 'I like writing Python codes'],
'Category' :['Data_Science','Cricket','Data_Science']})
创建一个名为 vectorizer
的 TfidfVectorizer()
对象
vectorizer = TfidfVectorizer()
将其拟合到训练数据上
X_train = vectorizer.fit_transform(train['Text'])
print(vectorizer.get_feature_names())
#['am', 'cricket', 'data', 'fast', 'favorite', 'for', 'is', 'love', 'mining', 'my', 'on', 'playing', 'python', 'regularly', 'scientist', 'sport', 'very', 'work']
feature_names = vectorizer.get_feature_names()
df= pd.DataFrame(X.toarray(),columns=feature_names)
现在看看如果你在测试数据集上做同样的事情会发生什么:
vectorizer_test = TfidfVectorizer()
X_test = vectorizer_test.fit_transform(test['Text'])
print(vectorizer_test.get_feature_names())
#['am', 'codes', 'cricket', 'data', 'field', 'like', 'new', 'on', 'play', 'python', 'science', 'to', 'weekends', 'writing']
feature_names_test = vectorizer_test.get_feature_names()
df_test= pd.DataFrame(X_test.toarray(),columns = feature_names_test)
它用测试数据集创建了另一个词汇表,它有 14 个独特的词(列),而训练数据有 18 个词(列)。
现在,如果您在 text-classification
的训练数据上训练机器学习算法,并尝试根据测试数据对您的矩阵进行预测,它将失败并生成一个错误,即训练和训练之间的特征不同测试数据.
为了克服这个错误,我们在 text-classification
:
中做了类似的事情
X_test_from_train = vectorizer.transform(test['Text'])
feature_names_test_from_train = vectorizer.get_feature_names()
df_test_from_train = pd.DataFrame(X_test_from_train.toarray(),columns = feature_names_test_from_train)
在这里你会注意到我们没有使用 fit_transform
命令而是我们在测试数据上使用了 transform
,原因是一样的,在对测试数据进行预测时,我们只想要使用训练数据和测试数据中相似的特征,这样我们就不会出现特征不匹配错误。
希望对您有所帮助!!
我正在做一个模糊匹配项目,我发现了一个非常有趣的方法:awesome_cossim_top
我总体上理解了这个定义,但不明白当我们这样做时发生了什么fit_transform
import pandas as pd
import sqlite3 as sql
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct
import re
def ngrams(string, n=3):
string = re.sub(r'[,-./]|\sBD',r'', re.sub(' +', ' ',str(string)))
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def awesome_cossim_top(A, B, ntop, lower_bound=0):
# force A and B as a CSR matrix.
# If they have already been CSR, there is no overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape
idx_dtype = np.int32
nnz_max = M*ntop
indptr = np.zeros(M+1, dtype=idx_dtype)
indices = np.zeros(nnz_max, dtype=idx_dtype)
data = np.zeros(nnz_max, dtype=A.dtype)
ct.sparse_dot_topn(
M, N, np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr, dtype=idx_dtype),
np.asarray(B.indices, dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data)
print('ct.sparse_dot_topn: ', ct.sparse_dot_topn)
return csr_matrix((data,indices,indptr),shape=(M,N))
def get_matches_df(sparse_matrix, A, B, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = A[sparserows[index]]
right_side[index] = B[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'left_side': left_side,
'right_side': right_side,
'similairity': similairity})
这是我遇到困惑的脚本: 为什么我们应该先使用 fit_transform 然后只使用 SAME 向量化器进行转换。 我试图从矢量化器和矩阵打印一些输出,如 print(vectorizer.get_feature_names()) 但不理解其中的逻辑。
有谁能帮我解释一下吗?
非常感谢!!
Col_clean = 'fruits_normalized'
Col_dirty = 'fruits'
#read table
data_dirty={f'{Col_dirty}':['I am an apple', 'You are an apple', 'Aple', 'Appls', 'Apples']}
data_clean= {f'{Col_clean}':['apple', 'pear', 'banana', 'apricot', 'pineapple']}
df_clean = pd.DataFrame(data_clean)
df_dirty = pd.DataFrame(data_dirty)
Name_clean = df_clean[f'{Col_clean}'].unique()
Name_dirty= df_dirty[f'{Col_dirty}'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
clean_idf_matrix = vectorizer.fit_transform(Name_clean)
dirty_idf_matrix = vectorizer.transform(Name_dirty)
matches = awesome_cossim_top(dirty_idf_matrix, clean_idf_matrix.transpose(),1,0)
matches_df = get_matches_df(matches, Name_dirty, Name_clean, top = 0)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
matches_df.to_excel("output_apple.xlsx")
print('done')
TfidfVectorizer.fit_transform
用于从训练数据集创建词汇表,TfidfVectorizer.transform
用于将该词汇表映射到测试数据集,以便测试数据中的特征数量与训练数据保持相同。以下示例可能有所帮助:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
创建虚拟训练数据:
train = pd.DataFrame({'Text' :['I am a data scientist','Cricket is my favorite sport', 'I work on Python regularly', 'Python is very fast for data mining', 'I love playing cricket'],
'Category' :['Data_Science','Cricket','Data_Science','Data_Science','Cricket']})
还有一个小测试数据:
test = pd.DataFrame({'Text' :['I am new to data science field', 'I play cricket on weekends', 'I like writing Python codes'],
'Category' :['Data_Science','Cricket','Data_Science']})
创建一个名为 vectorizer
TfidfVectorizer()
对象
vectorizer = TfidfVectorizer()
将其拟合到训练数据上
X_train = vectorizer.fit_transform(train['Text'])
print(vectorizer.get_feature_names())
#['am', 'cricket', 'data', 'fast', 'favorite', 'for', 'is', 'love', 'mining', 'my', 'on', 'playing', 'python', 'regularly', 'scientist', 'sport', 'very', 'work']
feature_names = vectorizer.get_feature_names()
df= pd.DataFrame(X.toarray(),columns=feature_names)
现在看看如果你在测试数据集上做同样的事情会发生什么:
vectorizer_test = TfidfVectorizer()
X_test = vectorizer_test.fit_transform(test['Text'])
print(vectorizer_test.get_feature_names())
#['am', 'codes', 'cricket', 'data', 'field', 'like', 'new', 'on', 'play', 'python', 'science', 'to', 'weekends', 'writing']
feature_names_test = vectorizer_test.get_feature_names()
df_test= pd.DataFrame(X_test.toarray(),columns = feature_names_test)
它用测试数据集创建了另一个词汇表,它有 14 个独特的词(列),而训练数据有 18 个词(列)。
现在,如果您在 text-classification
的训练数据上训练机器学习算法,并尝试根据测试数据对您的矩阵进行预测,它将失败并生成一个错误,即训练和训练之间的特征不同测试数据.
为了克服这个错误,我们在 text-classification
:
X_test_from_train = vectorizer.transform(test['Text'])
feature_names_test_from_train = vectorizer.get_feature_names()
df_test_from_train = pd.DataFrame(X_test_from_train.toarray(),columns = feature_names_test_from_train)
在这里你会注意到我们没有使用 fit_transform
命令而是我们在测试数据上使用了 transform
,原因是一样的,在对测试数据进行预测时,我们只想要使用训练数据和测试数据中相似的特征,这样我们就不会出现特征不匹配错误。
希望对您有所帮助!!