如何在 Python Scikit-learn 中存储预测的 类 匹配预向量化的 X?
How to store predicted classes matching the pre-vectorized X in Python Scikit-learn?
我想用名字来预测性别。不仅是名称,还有名称特征,例如提取 "last name" 作为从名称派生的特征。我的代码流程是这样的,将数据导入 df > 指定 lr 分类器和 dv dictVectorizer > 使用函数创建特征 > 执行 dictVectorization > 训练。我想执行以下操作,但找不到有关操作方法的任何资源。
1) 我想将预测的 类 (0 和 1) 添加回原始数据集或我可以同时看到姓名和预测性别的数据集 类 .目前我的 y_test_predictions 只对应于 X_test 这是一个稀疏矩阵。
2) 如何保留经过训练的分类器并使用它来预测具有一堆名称的不同数据集的性别?我怎样才能插入一个名字 "Rick Grime" 并让分类器告诉我它预测的性别?
我用 nltk 做了类似的事情,但是在 Scikit-learn 中找不到任何例子或参考来做这件事。
代码:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import random
import time
from random import randint
import csv
import sys
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn import cross_validation
data = pd.read_csv("file.csv", header=0, encoding="utf-8")
df = DataFrame(data)
dv = DictVectorizer()
lr = LogisticRegression()
X = df.raw_name.values
X2 = df.name.values
y = df.gender.values
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_name_entity(nameString2):
space = 0
try:
for i in nameString2:
if i == ' ':
space += 1
return space+1
except: return 0
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
my_dict2 = [{'name-entity': feature_name_entity(feature_full_name(i))} for i in X2]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict2[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=0.3)
lr.fit(X_train, y_train)
y_test_predictions = lr.predict(X_test)
我会使用一些 scikit-learn 的内置工具来拆分数据帧、向量化名称并预测结果。然后您可以将预测结果添加回测试数据框中。例如,以一小组姓名为例:
data = {'Bruce Lee': 'Male',
'Bruce Banner': 'Male',
'Bruce Springsteen': 'Male',
'Bruce Willis': 'Male',
'Sarah McLaughlin': 'Female',
'Sarah Silverman': 'Female',
'Sarah Palin': 'Female',
'Sarah Hyland': 'Female'}
import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index').reset_index()
df.columns = ['name', 'gender']
print(df)
name gender
0 Sarah Silverman Female
1 Sarah Palin Female
2 Bruce Springsteen Male
3 Bruce Banner Male
4 Bruce Lee Male
5 Sarah Hyland Female
6 Sarah McLaughlin Female
7 Bruce Willis Male
现在我们可以使用 scikit-learn 的 CountVectorizer
来计算名字中的单词;这产生的输出与您在上面所做的基本相同,除了它不过滤名称长度等。为了便于使用,我们将把它放在一个具有交叉验证逻辑回归的管道中:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline
clf = make_pipeline(CountVectorizer(), LogisticRegressionCV(cv=2))
现在我们可以将数据分成 train/test 组,拟合管道,然后分配结果:
from sklearn.cross_validation import train_test_split
df_train, df_test = train_test_split(df, train_size=0.5, random_state=0)
clf.fit(df_train['name'], df_train['gender'])
df_test = df_test.copy() # so we can modify it
df_test['predicted'] = clf.predict(df_test['name'])
print(df_test)
name gender predicted
6 Sarah McLaughlin Female Female
2 Bruce Springsteen Male Male
1 Sarah Palin Female Female
7 Bruce Willis Male Male
同样,我们可以将姓名列表传递给管道并获得预测:
>>> clf.predict(['Bruce Campbell', 'Sarah Roemer'])
array(['Male', 'Female'], dtype=object)
如果您想在文本矢量化中执行更复杂的逻辑,您可以为输入数据创建一个自定义转换器:网络搜索 "scikit-learn custom transformer" 应该会为您提供一组不错的示例。
编辑:这是一个使用自定义转换器从输入名称生成字典的示例:
from sklearn.base import TransformerMixin
class ExtractNames(TransformerMixin):
def transform(self, X, *args):
return [{'first': name.split()[0],
'last': name.split()[-1]}
for name in X]
def fit(self, *args):
return self
trans = ExtractNames()
>>> trans.fit_transform(df['name'])
[{'first': 'Bruce', 'last': 'Springsteen'},
{'first': 'Bruce', 'last': 'Banner'},
{'first': 'Sarah', 'last': 'Hyland'},
{'first': 'Sarah', 'last': 'Silverman'},
{'first': 'Sarah', 'last': 'Palin'},
{'first': 'Bruce', 'last': 'Lee'},
{'first': 'Bruce', 'last': 'Willis'},
{'first': 'Sarah', 'last': 'McLaughlin'}]
现在您可以将其放入具有 DictVectorizer
的管道中以生成稀疏特征:
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ExtractNames(), DictVectorizer())
>>> pipe.fit_transform(df['name'])
<8x10 sparse matrix of type '<class 'numpy.float64'>'
with 16 stored elements in Compressed Sparse Row format>
最后,您可以制作一个管道,将这些与交叉验证的逻辑回归结合起来,然后按上述步骤进行:
clf = make_pipeline(ExtractNames(), DictVectorizer(), LogisticRegressionCV())
clf.fit(df_train['name'], df_train['gender'])
df_test['predicted'] = clf.predict(df_test['name'])
从这里开始,如果您愿意,您可以修改 ExtractNames
变换器以进行更复杂的特征提取(使用上面的一些代码),最终得到您的过程的流水线实现,但让您只需在输入的字符串列表上调用 predict()
。希望对您有所帮助!
我想用名字来预测性别。不仅是名称,还有名称特征,例如提取 "last name" 作为从名称派生的特征。我的代码流程是这样的,将数据导入 df > 指定 lr 分类器和 dv dictVectorizer > 使用函数创建特征 > 执行 dictVectorization > 训练。我想执行以下操作,但找不到有关操作方法的任何资源。
1) 我想将预测的 类 (0 和 1) 添加回原始数据集或我可以同时看到姓名和预测性别的数据集 类 .目前我的 y_test_predictions 只对应于 X_test 这是一个稀疏矩阵。
2) 如何保留经过训练的分类器并使用它来预测具有一堆名称的不同数据集的性别?我怎样才能插入一个名字 "Rick Grime" 并让分类器告诉我它预测的性别?
我用 nltk 做了类似的事情,但是在 Scikit-learn 中找不到任何例子或参考来做这件事。
代码:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import re
import random
import time
from random import randint
import csv
import sys
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn import cross_validation
data = pd.read_csv("file.csv", header=0, encoding="utf-8")
df = DataFrame(data)
dv = DictVectorizer()
lr = LogisticRegression()
X = df.raw_name.values
X2 = df.name.values
y = df.gender.values
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_name_entity(nameString2):
space = 0
try:
for i in nameString2:
if i == ' ':
space += 1
return space+1
except: return 0
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
my_dict2 = [{'name-entity': feature_name_entity(feature_full_name(i))} for i in X2]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict2[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=0.3)
lr.fit(X_train, y_train)
y_test_predictions = lr.predict(X_test)
我会使用一些 scikit-learn 的内置工具来拆分数据帧、向量化名称并预测结果。然后您可以将预测结果添加回测试数据框中。例如,以一小组姓名为例:
data = {'Bruce Lee': 'Male',
'Bruce Banner': 'Male',
'Bruce Springsteen': 'Male',
'Bruce Willis': 'Male',
'Sarah McLaughlin': 'Female',
'Sarah Silverman': 'Female',
'Sarah Palin': 'Female',
'Sarah Hyland': 'Female'}
import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index').reset_index()
df.columns = ['name', 'gender']
print(df)
name gender
0 Sarah Silverman Female
1 Sarah Palin Female
2 Bruce Springsteen Male
3 Bruce Banner Male
4 Bruce Lee Male
5 Sarah Hyland Female
6 Sarah McLaughlin Female
7 Bruce Willis Male
现在我们可以使用 scikit-learn 的 CountVectorizer
来计算名字中的单词;这产生的输出与您在上面所做的基本相同,除了它不过滤名称长度等。为了便于使用,我们将把它放在一个具有交叉验证逻辑回归的管道中:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline
clf = make_pipeline(CountVectorizer(), LogisticRegressionCV(cv=2))
现在我们可以将数据分成 train/test 组,拟合管道,然后分配结果:
from sklearn.cross_validation import train_test_split
df_train, df_test = train_test_split(df, train_size=0.5, random_state=0)
clf.fit(df_train['name'], df_train['gender'])
df_test = df_test.copy() # so we can modify it
df_test['predicted'] = clf.predict(df_test['name'])
print(df_test)
name gender predicted
6 Sarah McLaughlin Female Female
2 Bruce Springsteen Male Male
1 Sarah Palin Female Female
7 Bruce Willis Male Male
同样,我们可以将姓名列表传递给管道并获得预测:
>>> clf.predict(['Bruce Campbell', 'Sarah Roemer'])
array(['Male', 'Female'], dtype=object)
如果您想在文本矢量化中执行更复杂的逻辑,您可以为输入数据创建一个自定义转换器:网络搜索 "scikit-learn custom transformer" 应该会为您提供一组不错的示例。
编辑:这是一个使用自定义转换器从输入名称生成字典的示例:
from sklearn.base import TransformerMixin
class ExtractNames(TransformerMixin):
def transform(self, X, *args):
return [{'first': name.split()[0],
'last': name.split()[-1]}
for name in X]
def fit(self, *args):
return self
trans = ExtractNames()
>>> trans.fit_transform(df['name'])
[{'first': 'Bruce', 'last': 'Springsteen'},
{'first': 'Bruce', 'last': 'Banner'},
{'first': 'Sarah', 'last': 'Hyland'},
{'first': 'Sarah', 'last': 'Silverman'},
{'first': 'Sarah', 'last': 'Palin'},
{'first': 'Bruce', 'last': 'Lee'},
{'first': 'Bruce', 'last': 'Willis'},
{'first': 'Sarah', 'last': 'McLaughlin'}]
现在您可以将其放入具有 DictVectorizer
的管道中以生成稀疏特征:
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(ExtractNames(), DictVectorizer())
>>> pipe.fit_transform(df['name'])
<8x10 sparse matrix of type '<class 'numpy.float64'>'
with 16 stored elements in Compressed Sparse Row format>
最后,您可以制作一个管道,将这些与交叉验证的逻辑回归结合起来,然后按上述步骤进行:
clf = make_pipeline(ExtractNames(), DictVectorizer(), LogisticRegressionCV())
clf.fit(df_train['name'], df_train['gender'])
df_test['predicted'] = clf.predict(df_test['name'])
从这里开始,如果您愿意,您可以修改 ExtractNames
变换器以进行更复杂的特征提取(使用上面的一些代码),最终得到您的过程的流水线实现,但让您只需在输入的字符串列表上调用 predict()
。希望对您有所帮助!