scikit-learn:FeatureUnion 包含手工制作的功能
scikit-learn: FeatureUnion to include hand crafted features
我正在对文本数据执行多标签分类。
我希望使用 tfidf
的组合特征和类似于示例 here using FeatureUnion 的自定义语言特征。
我已经生成了自定义语言特征,它们采用字典的形式,其中键代表标签,值(列表)代表特征。
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
训练数据结构如下:
text contact demographic location
--- --- --- ---
'provide us with your date of birth and e-mail' 1 1 0
'contact details and location will be stored' 1 0 1
'date of birth should be before 2004' 0 1 0
如何将上面的dict
合并到FeatureUnion
中呢?我的理解是,应该调用用户定义的函数,该函数对应于训练数据中是否存在字符串值(来自 custom_features_dict
)的 returns 布尔值。
对于给定的训练数据,这给出了以下 list
的 dict
:
[
{
'contact':1,
'demographic':1,
'location':0
},
{
'contact':1,
'demographic':0,
'location':1
},
{
'contact':0,
'demographic':1,
'location':0
},
]
如何使用上面的list
来实现拟合和变换?
代码如下:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from io import StringIO
data = StringIO(u'''text,contact,demographic,location
provide us with your date of birth and e-mail,1,1,0
contact details and location will be stored,0,1,1
date of birth should be before 2004,0,1,0''')
df = pd.read_csv(data)
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
my_features = [
{
'contact':1,
'demographic':1,
'location':0
},
{
'contact':1,
'demographic':0,
'location':1
},
{
'contact':0,
'demographic':1,
'location':0
},
]
bow_pipeline = Pipeline(
steps=[
("tfidf", TfidfVectorizer(stop_words=stop_words)),
]
)
manual_pipeline = Pipeline(
steps=[
# This needs to be fixed
("custom_features", my_features),
("dict_vect", DictVectorizer()),
]
)
combined_features = FeatureUnion(
transformer_list=[
("bow", bow_pipeline),
("manual", manual_pipeline),
]
)
final_pipeline = Pipeline([
('combined_features', combined_features),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
]
)
labels = ['contact', 'demographic', 'location']
for label in labels:
final_pipeline.fit(df['text'], df[label])
您必须定义一个将您的文本作为输入的转换器。类似的东西:
from sklearn.base import BaseEstimator, TransformerMixin
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
#helper function which returns 1, if one of the words occures in the text, else 0
#you can add more words or categories to custom_features_dict if you want
def is_words_present(text, listofwords):
for word in listofwords:
if word in text:
return 1
return 0
class CustomFeatureTransformer(BaseEstimator, TransformerMixin):
def __init__(self, custom_feature_dict):
self.custom_feature_dict = custom_feature_dict
def fit(self, x, y=None):
return self
def transform(self, data):
result_arr = []
for text in data:
arr = []
for key in self.custom_feature_dict:
arr.append(is_words_present(text, self.custom_feature_dict[key]))
result_arr.append(arr)
return result_arr
注意:这个Transformer直接生成一个数组,如下所示:[1, 0, 1]
,它不生成字典,这样我们就可以省去DictVectorizer。
另外我改变了处理多标签分类的方式,见here:
#first, i generate a new column in the dataframe, with all the labels per row:
def create_textlabels_array(row):
arr = []
for label in ['contact', 'demographic', 'location']:
if row[label]==1:
arr.append(label)
return arr
df['textlabels'] = df.apply(create_textlabels_array, 1)
#then we generate the binarized Labels:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer().fit(df['textlabels'])
y = mlb.transform(df['textlabels'])
现在我们可以将所有内容一起添加到管道中:
bow_pipeline = Pipeline(
steps=[
("tfidf", TfidfVectorizer(stop_words=stop_words)),
]
)
manual_pipeline = Pipeline(
steps=[
("costum_vect", CustomFeatureTransformer(custom_features_dict)),
]
)
combined_features = FeatureUnion(
transformer_list=[
("bow", bow_pipeline),
("manual", manual_pipeline),
]
)
final_pipeline = Pipeline([
('combined_features', combined_features),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
]
)
#train your pipeline
final_pipeline.fit(df['text'], y)
#let's predict something: (Note: of course training data is a bit low in that examplecase here)
pred = final_pipeline.predict(["write an e-mail to our location please"])
print(pred) #output: [0, 1, 1]
#reverse the predicted array to the actual labels:
print(mlb.inverse_transform(pred)) #output: [('demographic', 'location')]
如果我们只想修复标记为已修复的那部分代码,我们所需要做的就是实现一个新的估计器,扩展 class sklearn.base.BaseEstimator(class TemplateClassifier 是一个很好的例子 here).
然而这里似乎存在概念上的错误。列表 my_features 中的信息似乎是标签本身(好吧,有人可能会说它们是非常强大的特征......)。所以,我们不应该把标签放在特征管道中。
如所述here、
Transformers are usually combined with classifiers, regressors or
other estimators to build a composite estimator. The most common tool
is a Pipeline. Pipeline is often used in combination with FeatureUnion
which concatenates the output of transformers into a composite feature
space. TransformedTargetRegressor deals with transforming the target
(i.e. log-transform y). In contrast, Pipelines only transform the
observed data (X).
就是说,如果您仍想将该列表信息放在转换方法中,则应该是这样的:
def transform_str(one_line_text: str) -> dict:
""" Transforms one line of text to dict features using manually extracted information"""
# manually extracted information
custom_features_dict = {'contact': ['contact details', 'e-mail'],
'demographic': ['gender', 'age', 'birth'],
'location': ['location', 'geo']}
# simple tokenization. it can be improved using some text pre-processing lib
tokenized_text = one_line_text.split(" ")
output = dict()
for feature,tokens in custom_features_dict.items():
output[feature] = False
for word in tokenized_text:
if word in tokens:
output[feature] = True
return output
def transform(text_list: list) -> list:
output = list()
for one_line_text in text_list:
output.append(transform_str(one_line_text))
return output
在这种情况下,您不需要 fit 方法,因为拟合是手动完成的。
我正在对文本数据执行多标签分类。
我希望使用 tfidf
的组合特征和类似于示例 here using FeatureUnion 的自定义语言特征。
我已经生成了自定义语言特征,它们采用字典的形式,其中键代表标签,值(列表)代表特征。
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
训练数据结构如下:
text contact demographic location
--- --- --- ---
'provide us with your date of birth and e-mail' 1 1 0
'contact details and location will be stored' 1 0 1
'date of birth should be before 2004' 0 1 0
如何将上面的dict
合并到FeatureUnion
中呢?我的理解是,应该调用用户定义的函数,该函数对应于训练数据中是否存在字符串值(来自 custom_features_dict
)的 returns 布尔值。
对于给定的训练数据,这给出了以下 list
的 dict
:
[
{
'contact':1,
'demographic':1,
'location':0
},
{
'contact':1,
'demographic':0,
'location':1
},
{
'contact':0,
'demographic':1,
'location':0
},
]
如何使用上面的list
来实现拟合和变换?
代码如下:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from io import StringIO
data = StringIO(u'''text,contact,demographic,location
provide us with your date of birth and e-mail,1,1,0
contact details and location will be stored,0,1,1
date of birth should be before 2004,0,1,0''')
df = pd.read_csv(data)
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
my_features = [
{
'contact':1,
'demographic':1,
'location':0
},
{
'contact':1,
'demographic':0,
'location':1
},
{
'contact':0,
'demographic':1,
'location':0
},
]
bow_pipeline = Pipeline(
steps=[
("tfidf", TfidfVectorizer(stop_words=stop_words)),
]
)
manual_pipeline = Pipeline(
steps=[
# This needs to be fixed
("custom_features", my_features),
("dict_vect", DictVectorizer()),
]
)
combined_features = FeatureUnion(
transformer_list=[
("bow", bow_pipeline),
("manual", manual_pipeline),
]
)
final_pipeline = Pipeline([
('combined_features', combined_features),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
]
)
labels = ['contact', 'demographic', 'location']
for label in labels:
final_pipeline.fit(df['text'], df[label])
您必须定义一个将您的文本作为输入的转换器。类似的东西:
from sklearn.base import BaseEstimator, TransformerMixin
custom_features_dict = {'contact':['contact details', 'e-mail'],
'demographic':['gender', 'age', 'birth'],
'location':['location', 'geo']}
#helper function which returns 1, if one of the words occures in the text, else 0
#you can add more words or categories to custom_features_dict if you want
def is_words_present(text, listofwords):
for word in listofwords:
if word in text:
return 1
return 0
class CustomFeatureTransformer(BaseEstimator, TransformerMixin):
def __init__(self, custom_feature_dict):
self.custom_feature_dict = custom_feature_dict
def fit(self, x, y=None):
return self
def transform(self, data):
result_arr = []
for text in data:
arr = []
for key in self.custom_feature_dict:
arr.append(is_words_present(text, self.custom_feature_dict[key]))
result_arr.append(arr)
return result_arr
注意:这个Transformer直接生成一个数组,如下所示:[1, 0, 1]
,它不生成字典,这样我们就可以省去DictVectorizer。
另外我改变了处理多标签分类的方式,见here:
#first, i generate a new column in the dataframe, with all the labels per row:
def create_textlabels_array(row):
arr = []
for label in ['contact', 'demographic', 'location']:
if row[label]==1:
arr.append(label)
return arr
df['textlabels'] = df.apply(create_textlabels_array, 1)
#then we generate the binarized Labels:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer().fit(df['textlabels'])
y = mlb.transform(df['textlabels'])
现在我们可以将所有内容一起添加到管道中:
bow_pipeline = Pipeline(
steps=[
("tfidf", TfidfVectorizer(stop_words=stop_words)),
]
)
manual_pipeline = Pipeline(
steps=[
("costum_vect", CustomFeatureTransformer(custom_features_dict)),
]
)
combined_features = FeatureUnion(
transformer_list=[
("bow", bow_pipeline),
("manual", manual_pipeline),
]
)
final_pipeline = Pipeline([
('combined_features', combined_features),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
]
)
#train your pipeline
final_pipeline.fit(df['text'], y)
#let's predict something: (Note: of course training data is a bit low in that examplecase here)
pred = final_pipeline.predict(["write an e-mail to our location please"])
print(pred) #output: [0, 1, 1]
#reverse the predicted array to the actual labels:
print(mlb.inverse_transform(pred)) #output: [('demographic', 'location')]
如果我们只想修复标记为已修复的那部分代码,我们所需要做的就是实现一个新的估计器,扩展 class sklearn.base.BaseEstimator(class TemplateClassifier 是一个很好的例子 here).
然而这里似乎存在概念上的错误。列表 my_features 中的信息似乎是标签本身(好吧,有人可能会说它们是非常强大的特征......)。所以,我们不应该把标签放在特征管道中。
如所述here、
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).
就是说,如果您仍想将该列表信息放在转换方法中,则应该是这样的:
def transform_str(one_line_text: str) -> dict:
""" Transforms one line of text to dict features using manually extracted information"""
# manually extracted information
custom_features_dict = {'contact': ['contact details', 'e-mail'],
'demographic': ['gender', 'age', 'birth'],
'location': ['location', 'geo']}
# simple tokenization. it can be improved using some text pre-processing lib
tokenized_text = one_line_text.split(" ")
output = dict()
for feature,tokens in custom_features_dict.items():
output[feature] = False
for word in tokenized_text:
if word in tokens:
output[feature] = True
return output
def transform(text_list: list) -> list:
output = list()
for one_line_text in text_list:
output.append(transform_str(one_line_text))
return output
在这种情况下,您不需要 fit 方法,因为拟合是手动完成的。