使用 Scikit-Learn 的 feature_extraction 创建虚拟变量

Creating dummy variables using Scikit-Learn's feature_extraction

我的目标是使用 SkLearn 从列创建虚拟变量。

所以我有如下数据:

INDICATOR MATCHUP 
1         [   "APPLE",   "GRAPE" ]
1         [   "APPLE",   "GRAPE" ]
0         [   "GRAPE",   "BANANA" ]
0         [   "PEAR",   "ORANGE" ]
1         [   "ORANGE",   "APPLE" ]

数据字典如下:

{'INDICATOR': [1, 1, 0, 0, 1],
 'MATCHUP': ['[   "APPLE",   "GRAPE" ]',
  '[   "APPLE",   "GRAPE" ]',
  '[   "GRAPE",   "BANANA" ]',
  '[   "PEAR",   "ORANGE" ]',
  '[   "ORANGE",   "APPLE" ]']}

所以我想利用 Sklearn 的文本 TfidfVectorizer。由于我正在构建的管道的性质,我需要使用这个包。

最终结果:

INDICATOR MATCHUP                    APPLE GRAPE BANANA PEAR ORANGE
1         [   "APPLE",   "GRAPE" ]   1     1     0      0    0 
1         [   "APPLE",   "GRAPE" ]   1     1     0      0    0
0         [   "GRAPE",   "BANANA" ]  0     1     1      0    0
0         [   "PEAR",   "ORANGE" ]   0     0     0      1    1
1         [   "ORANGE",   "APPLE" ]  1     0     0      0    1

我能够在没有 Sklearn 的情况下成功进行操作(见下文),但我现在需要使用这个 Sklearn 函数来做到这一点。

df.join(df['MATCHUP'].map(ast.literal_eval).explode().str.get_dummies().groupby(level=0).sum())

我不能使用它,因为它最终会进入 ColumnTransformer,所以如果我们可以使用 SciKit-Learn,我将不胜感激。

使用 sklearn 中的 built-in 文本处理器比您自己构建的手动方法容易得多。它会自动处理这样一个事实,即您的列表列实际上是一列字符串,通过忽略 non-alphanumeric 个字符看起来像列表。我还将展示您说必须使用的 tfidfvecorizer 与会产生给定输出的 countvectorizer 之间的区别。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer as cv
from sklearn.feature_extraction.text import TfidfVectorizer as tv

df=pd.DataFrame({'INDICATOR': [1, 1, 0, 0, 1],
 'MATCHUP': ['[   "APPLE",   "GRAPE" ]',
  '[   "APPLE",   "GRAPE" ]',
  '[   "GRAPE",   "BANANA" ]',
  '[   "PEAR",   "ORANGE" ]',
  '[   "ORANGE",   "APPLE" ]']}).set_index('INDICATOR')

df

    MATCHUP
INDICATOR   
1   [ "APPLE", "GRAPE" ]
1   [ "APPLE", "GRAPE" ]
0   [ "GRAPE", "BANANA" ]
0   [ "PEAR", "ORANGE" ]
1   [ "ORANGE", "APPLE" ]

这里值得注意的是您给出的输出与其中一个矢量化器的正常输出之间的区别;即稀疏矩阵。 Sklearn 不太关心您如何传递数据,但我将同时显示本机输出和数据帧输出。

TL;DR: 初始化一个矢量器,然后用它来 .fit_transform(df['MATCHUP'])
#first we apply the CountVectorizer to get the desired binary output
tf=cv()
#we will print the human-friendly version of the sparse matrix for comparison
print(tf.fit_transform(df['MATCHUP']))

  (0, 0)    1
  (0, 2)    1
  (1, 0)    1
  (1, 2)    1
  (2, 2)    1
  (2, 1)    1
  (3, 4)    1
  (3, 3)    1
  (4, 0)    1
  (4, 3)    1

#then we also convert to dense format and make a dataframe to show how it looks
count_df=pd.DataFrame(tf.fit_transform(df['MATCHUP']).todense(), columns=tf.get_feature_names())
print(count_df)

    apple   banana  grape   orange  pear
0   1       0       1       0       0
1   1       0       1       0       0
2   0       1       1       0       0
3   0       0       0       1       1
4   1       0       0       1       0

然后我们可以对 tfidf 做同样的事情来显示输出的差异,而没有用 idf 归一化的计数向量(注意非常相似的输出)

tf=tv()
print(tf.fit_transform(df['MATCHUP']))

  (0, 2)    0.7071067811865476
  (0, 0)    0.7071067811865476
  (1, 2)    0.7071067811865476
  (1, 0)    0.7071067811865476
  (2, 1)    0.830880748357988
  (2, 2)    0.5564505207186616
  (3, 3)    0.6279137616509933
  (3, 4)    0.7782829228046183
  (4, 3)    0.7694470729725092
  (4, 0)    0.6387105775654869

tfidf_df=pd.DataFrame(tf.fit_transform(df['MATCHUP']).todense(), columns=tf.get_feature_names())
print(tfidf_df)

    apple       banana      grape       orange      pear
0   0.707107    0.000000    0.707107    0.000000    0.000000
1   0.707107    0.000000    0.707107    0.000000    0.000000
2   0.000000    0.830881    0.556451    0.000000    0.000000
3   0.000000    0.000000    0.000000    0.627914    0.778283
4   0.638711    0.000000    0.000000    0.769447    0.000000

然后,为了完成与请求输出相匹配的视图,我们 link 将结果还原为原始结果(请注意,如果您使用带有 columntransformer/pipeline 的转换器,则完全不需要此步骤)

print(pd.concat([df.reset_index(),count_df], axis=1))

    INDICATOR   MATCHUP         apple   banana  grape   orange  pear
0   1       [ "APPLE", "GRAPE" ]    1       0   1       0       0
1   1       [ "APPLE", "GRAPE" ]    1       0   1       0       0
2   0       [ "GRAPE", "BANANA" ]   0       1   1       0       0
3   0       [ "PEAR", "ORANGE" ]    0       0   0       1       1
4   1       [ "ORANGE", "APPLE" ]   1       0   0       1       0

print(pd.concat([df.reset_index(),tfidf_df], axis=1))

    INDICATOR   MATCHUP         apple       banana      grape       orange      pear
0   1   [ "APPLE", "GRAPE" ]    0.707107    0.000000    0.707107    0.000000    0.000000
1   1   [ "APPLE", "GRAPE" ]    0.707107    0.000000    0.707107    0.000000    0.000000
2   0   [ "GRAPE", "BANANA" ]   0.000000    0.830881    0.556451    0.000000    0.000000
3   0   [ "PEAR", "ORANGE" ]    0.000000    0.000000    0.000000    0.627914    0.778283
4   1   [ "ORANGE", "APPLE" ]   0.638711    0.000000    0.000000    0.769447    0.000000