TfDif 和自定义特征之间 FeatureUnion 的 KeyError
KeyError on FeatureUnion between TfDif and custom features
我正在尝试创建一个模型,在该模型中,我将在文本列上使用 TfidfVectorizer,并在其他几个列上使用额外的文本数据。下面的代码重现了我正在尝试做的事情以及我得到的错误。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
class ParStats(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
print(X[0])
return [{'feat_1': x['feat_1'],
'feat_2': x['feat_2']}
for x in X]
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
def feature_union_test():
# create test data frame
test_data = {
'text': ['And the silken, sad, uncertain rustling of each purple curtain',
'Thrilled me filled me with fantastic terrors never felt before',
'So that now, to still the beating of my heart, I stood repeating',
'Tis some visitor entreating entrance at my chamber door',
'Some late visitor entreating entrance at my chamber door',
'This it is and nothing more'],
'feat_1': [4, 7, 10, 7, 4, 6],
'feat_2': [1, 5, 5, 1, 1, 10],
'ignore': [1, 1, 1, 0, 0, 0]
}
test_df = pd.DataFrame(data=test_data)
y_train = test_df['ignore'].values.astype('int')
# Feature Union Pipeline
pipeline = FeatureUnion([
('text', Pipeline([
('selector', ItemSelector(key='text')),
('tfidf', TfidfVectorizer(max_df=0.5)),
])),
('parstats', Pipeline([
('stats', ParStats()),
('vect', DictVectorizer()),
]))
])
tfidf = pipeline.fit_transform(test_df)
# fits Naive Bayes
clf = BernoulliNB().fit(tfidf, y_train)
feature_union_test()
当我 运行 执行此操作时,我收到以下错误消息:
Traceback (most recent call last):
File "C:\Users\Rogerio\Python VENV\lib\site-packages\pandas\core\indexes\base.py", line 3064, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
我尝试了几次不同的管道迭代,但总是会遇到某种错误,所以显然我遗漏了一些东西。我做错了什么?
您的ParStats
class中的transform
出现错误。
首先,pandas
不直接支持索引,所以您的 print(X[0])
抛出了您看到的错误。
而且您不能按照您的方式迭代 pandas
DataFrame。
这是该函数的一个可能的工作版本:
def transform(self, X):
return [{'feat_1': x[0], 'feat_2': x[1]}
for x in X[['feat_1', 'feat_2']].values]
当然,还有很多其他可能的解决方案,但您明白了。
好的。所以在评论中讨论后,这是你的问题陈述。
You want to pass the columns feat_1
, feat_2
along with the tfidf of text
column to your ml model.
所以您唯一需要做的就是:
# Feature Union Pipeline
pipeline = FeatureUnion([('text', Pipeline([('selector', ItemSelector(key='text')),
('tfidf', TfidfVectorizer(max_df=0.5)),
])),
('non_text', ItemSelector(key=['feat_1', 'feat_2']))
])
tfidf = pipeline.fit_transform(test_df)
默认值 ItemSelector
可用于一次 select 多个特征,这些特征将附加到特征的 text
部分的最后一个 tfidf 数据 return联盟.
我正在尝试创建一个模型,在该模型中,我将在文本列上使用 TfidfVectorizer,并在其他几个列上使用额外的文本数据。下面的代码重现了我正在尝试做的事情以及我得到的错误。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
class ParStats(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
print(X[0])
return [{'feat_1': x['feat_1'],
'feat_2': x['feat_2']}
for x in X]
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
def feature_union_test():
# create test data frame
test_data = {
'text': ['And the silken, sad, uncertain rustling of each purple curtain',
'Thrilled me filled me with fantastic terrors never felt before',
'So that now, to still the beating of my heart, I stood repeating',
'Tis some visitor entreating entrance at my chamber door',
'Some late visitor entreating entrance at my chamber door',
'This it is and nothing more'],
'feat_1': [4, 7, 10, 7, 4, 6],
'feat_2': [1, 5, 5, 1, 1, 10],
'ignore': [1, 1, 1, 0, 0, 0]
}
test_df = pd.DataFrame(data=test_data)
y_train = test_df['ignore'].values.astype('int')
# Feature Union Pipeline
pipeline = FeatureUnion([
('text', Pipeline([
('selector', ItemSelector(key='text')),
('tfidf', TfidfVectorizer(max_df=0.5)),
])),
('parstats', Pipeline([
('stats', ParStats()),
('vect', DictVectorizer()),
]))
])
tfidf = pipeline.fit_transform(test_df)
# fits Naive Bayes
clf = BernoulliNB().fit(tfidf, y_train)
feature_union_test()
当我 运行 执行此操作时,我收到以下错误消息:
Traceback (most recent call last):
File "C:\Users\Rogerio\Python VENV\lib\site-packages\pandas\core\indexes\base.py", line 3064, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
我尝试了几次不同的管道迭代,但总是会遇到某种错误,所以显然我遗漏了一些东西。我做错了什么?
您的ParStats
class中的transform
出现错误。
首先,pandas
不直接支持索引,所以您的 print(X[0])
抛出了您看到的错误。
而且您不能按照您的方式迭代 pandas
DataFrame。
这是该函数的一个可能的工作版本:
def transform(self, X):
return [{'feat_1': x[0], 'feat_2': x[1]}
for x in X[['feat_1', 'feat_2']].values]
当然,还有很多其他可能的解决方案,但您明白了。
好的。所以在评论中讨论后,这是你的问题陈述。
You want to pass the columns
feat_1
,feat_2
along with the tfidf oftext
column to your ml model.
所以您唯一需要做的就是:
# Feature Union Pipeline
pipeline = FeatureUnion([('text', Pipeline([('selector', ItemSelector(key='text')),
('tfidf', TfidfVectorizer(max_df=0.5)),
])),
('non_text', ItemSelector(key=['feat_1', 'feat_2']))
])
tfidf = pipeline.fit_transform(test_df)
默认值 ItemSelector
可用于一次 select 多个特征,这些特征将附加到特征的 text
部分的最后一个 tfidf 数据 return联盟.