如何在 Python Scikit-learn 中存储预测的 类 匹配预向量化的 X?

How to store predicted classes matching the pre-vectorized X in Python Scikit-learn?

我想用名字来预测性别。不仅是名称,还有名称特征,例如提取 "last name" 作为从名称派生的特征。我的代码流程是这样的,将数据导入 df > 指定 lr 分类器和 dv dictVectorizer > 使用函数创建特征 > 执行 dictVectorization > 训练。我想执行以下操作,但找不到有关操作方法的任何资源。

1) 我想将预测的 类 (0 和 1) 添加回原始数据集或我可以同时看到姓名和预测性别的数据集 类 .目前我的 y_test_predictions 只对应于 X_test 这是一个稀疏矩阵。

2) 如何保留经过训练的分类器并使用它来预测具有一堆名称的不同数据集的性别?我怎样才能插入一个名字 "Rick Grime" 并让分类器告诉我它预测的性别?

我用 nltk 做了类似的事情,但是在 Scikit-learn 中找不到任何例子或参考来做这件事。

代码:

    import pandas as pd
    from pandas import DataFrame, Series
    import numpy as np
    import re
    import random
    import time
    from random import randint
    import csv
    import sys
    from sklearn.metrics import classification_report
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import LinearSVC
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.feature_extraction import DictVectorizer
    from sklearn.metrics import confusion_matrix as sk_confusion_matrix
    from sklearn.metrics import roc_curve, auc
    import matplotlib.pyplot as plt
    from sklearn.metrics import precision_recall_curve
    from sklearn import cross_validation 

    data = pd.read_csv("file.csv", header=0, encoding="utf-8")
    df = DataFrame(data)
    dv = DictVectorizer()
    lr = LogisticRegression()

    X = df.raw_name.values
    X2 = df.name.values
    y = df.gender.values

    def feature_full_name(nameString):
        try:
            full_name = nameString
            if len(full_name) > 1: # not accept name with only 1 character
                return full_name
            else: return '?'
        except: return '?'

    def feature_full_last_name(nameString):
        try:
            last_name = nameString.rsplit(None, 1)[-1]
            if len(last_name) > 1: # not accept name with only 1 character
                return last_name
            else: return '?'
        except: return '?'

    def feature_name_entity(nameString2):
        space = 0
        try:
            for i in nameString2:
                if i == ' ':
                    space += 1
            return space+1
        except: return 0

    my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
    my_dict2 = [{'name-entity': feature_name_entity(feature_full_name(i))} for i in X2]


    all_dict = []
    for i in range(0, len(my_dict)):
        temp_dict = dict(
            my_dict[i].items() + my_dict2[i].items()
            )
        all_dict.append(temp_dict)

    newX = dv.fit_transform(all_dict)

    X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=0.3)

    lr.fit(X_train, y_train)

    y_test_predictions = lr.predict(X_test)

我会使用一些 scikit-learn 的内置工具来拆分数据帧、向量化名称并预测结果。然后您可以将预测结果添加回测试数据框中。例如,以一小组姓名为例:

data = {'Bruce Lee': 'Male',
        'Bruce Banner': 'Male',
        'Bruce Springsteen': 'Male',
        'Bruce Willis': 'Male',
        'Sarah McLaughlin': 'Female',
        'Sarah Silverman': 'Female',
        'Sarah Palin': 'Female',
        'Sarah Hyland': 'Female'}

import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index').reset_index()
df.columns = ['name', 'gender']
print(df)
                name  gender
0    Sarah Silverman  Female
1        Sarah Palin  Female
2  Bruce Springsteen    Male
3       Bruce Banner    Male
4          Bruce Lee    Male
5       Sarah Hyland  Female
6   Sarah McLaughlin  Female
7       Bruce Willis    Male

现在我们可以使用 scikit-learn 的 CountVectorizer 来计算名字中的单词;这产生的输出与您在上面所做的基本相同,除了它不过滤名称长度等。为了便于使用,我们将把它放在一个具有交叉验证逻辑回归的管道中:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline

clf = make_pipeline(CountVectorizer(), LogisticRegressionCV(cv=2))

现在我们可以将数据分成 train/test 组,拟合管道,然后分配结果:

from sklearn.cross_validation import train_test_split

df_train, df_test = train_test_split(df, train_size=0.5, random_state=0)

clf.fit(df_train['name'], df_train['gender'])

df_test = df_test.copy() # so we can modify it
df_test['predicted'] = clf.predict(df_test['name'])

print(df_test)
                name  gender predicted
6   Sarah McLaughlin  Female    Female
2  Bruce Springsteen    Male      Male
1        Sarah Palin  Female    Female
7       Bruce Willis    Male      Male

同样,我们可以将姓名列表传递给管道并获得预测:

>>> clf.predict(['Bruce Campbell', 'Sarah Roemer'])
array(['Male', 'Female'], dtype=object)

如果您想在文本矢量化中执行更复杂的逻辑,您可以为输入数据创建一个自定义转换器:网络搜索 "scikit-learn custom transformer" 应该会为您提供一组不错的示例。


编辑:这是一个使用自定义转换器从输入名称生成字典的示例:

from sklearn.base import TransformerMixin

class ExtractNames(TransformerMixin):
    def transform(self, X, *args):
        return [{'first': name.split()[0],
                 'last': name.split()[-1]}
                for name in X]

    def fit(self, *args):
        return self

trans = ExtractNames()

>>> trans.fit_transform(df['name'])
[{'first': 'Bruce', 'last': 'Springsteen'},
 {'first': 'Bruce', 'last': 'Banner'},
 {'first': 'Sarah', 'last': 'Hyland'},
 {'first': 'Sarah', 'last': 'Silverman'},
 {'first': 'Sarah', 'last': 'Palin'},
 {'first': 'Bruce', 'last': 'Lee'},
 {'first': 'Bruce', 'last': 'Willis'},
 {'first': 'Sarah', 'last': 'McLaughlin'}]

现在您可以将其放入具有 DictVectorizer 的管道中以生成稀疏特征:

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ExtractNames(), DictVectorizer())

>>> pipe.fit_transform(df['name'])
<8x10 sparse matrix of type '<class 'numpy.float64'>'
    with 16 stored elements in Compressed Sparse Row format>

最后,您可以制作一个管道,将这些与交叉验证的逻辑回归结合起来,然后按上述步骤进行:

clf = make_pipeline(ExtractNames(), DictVectorizer(), LogisticRegressionCV())
clf.fit(df_train['name'], df_train['gender'])
df_test['predicted'] = clf.predict(df_test['name'])

从这里开始,如果您愿意,您可以修改 ExtractNames 变换器以进行更复杂的特征提取(使用上面的一些代码),最终得到您的过程的流水线实现,但让您只需在输入的字符串列表上调用 predict()。希望对您有所帮助!