在 SVC(kernel='linear') 模型的 tfidfvectorizer 中提取 ngrams 的特征重要性

Extract feature importance of ngrams in tfidfvectorizer in SVC(kernel='linear') model

我想知道是什么导致了本应相同的输出出现差异。就像我的程序忽略了排序函数和 feature_names。 coef_ 的排序对于我找出哪些特征实际上对预测帮助最大是非常重要的。我从 vectorizer.get_feature_names 中得到了单个单词,但当它在循环或函数定义中时却没有。有没有人知道会发生什么,或者是否有人有其他方法来提取 ngram 特征权重及其名称,用于带有 kernel='linear'.

的 SVC

我的代码:

## load data features with removed columns based on numeric feature selection
df = pd.read_csv('preprocessed_data_all.csv', usecols=['normalized_fixed', 'TAG', 'DEP', 'level1', 'avg_wordlength',
 'lexical_variety',
 'avg_sentlength',
 'VBD_rel_cnt',
 'VBN_rel_cnt',
 'VBG_rel_cnt',
 'MD_rel_cnt',
 'np_rel_cnt',
 'clause_rel_cnt',
 'clause_rel_word_cnt'])

df = df.sample(n=2000)

# define X and y 
X = df.drop('level1', axis=1)
y = df.level1.values

## create pipeline for word unigrams
bow_pipe = Pipeline([
    ("text", ItemSelector(key="normalized_fixed")),
    ("bow_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

## create pipeline for pos tags
pos_pipe = Pipeline([
    ("pos", ItemSelector(key="TAG")),
    ("pos_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

## create pipeline for dependency tags
dep_pipe = Pipeline([
    ("dep", ItemSelector(key="DEP")),
    ("dep_vec", TfidfVectorizer(analyzer='word', tokenizer=word_tokenize, binary=False, lowercase=True))
])

# define classifier
svm = SVC(kernel='linear', class_weight='balanced')

# define pipeline for most important unigram extraction
pipe = Pipeline([
    ("feats", FeatureUnion([
        ("bow", bow_pipe),
        ("tag", pos_pipe),
        ("dep", dep_pipe)
    ])),
    ("clf", svm)
])

pipe.fit(X, y)

# display feature importance of BOW model
levels = df['level1'].unique()

for level in levels: 
    
    featuredf = pd.DataFrame()

    labelid = list(pipe.named_steps['clf'].classes_).index(level)
    feature_names = pipe.named_steps['feats'].transformer_list[0][1].named_steps['bow_vec'].get_feature_names()
    topn = sorted(zip(pipe.named_steps['clf'].coef_[labelid], feature_names))[-10:]

    for coef, feat in topn:
        featuredf = featuredf.append(pd.Series([level, feat, coef]), ignore_index = True)

    display(featuredf)

我的输出:

    0   1   2
0   A1  !   (0, 1834)\t-0.07826243560812945\n (0, 4347)\t-0.07826243560812945\n (0, 4760)\t-0.223132736239871\n (0, 5498)\t-0.07140284578344763\n (0, 6756)\t-0.16195282546411804\n (0, 8637)\t-0.06337764014791308\n (0, 8763)\t-0.07826243560812945\n (0, 9044)\t-0.08060172162144445\n (0, 901)\t-0.0026223432774063423\n (0, 5906)\t-0.16675468967573015\n (0, 6796)\t-0.04403627031278603\n (0, 8495)\t-0.2603055807883978\n (0, 8498)\t-0.17305812627971506\n (0, 8735)\t-0.34489400420874144\n (0, 9484)\t-0.11083343873432677\n (0, 2637)\t-0.18040783909656172\n (0, 2737)\t-0.5380874813828527\n (0, 3129)\t-0.013035612996414479\n (0, 3773)\t-0.08449907288128825\n (0, 4437)\t-0.013035612996414479\n (0, 4438)\t-0.026071225992828958\n (0, 5924)\t-0.013035612996414479\n (0, 7269)\t-0.3730438143689519\n (0, 7737)\t-0.705047869548585\n (0, 8722)\t-0.024098248030544236\n :\t:\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.08649110380251428\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t0.022121814035341927\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.3512603080433229\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t-0.005938550730073638\n (0, 8445)\t-0.06546829964399035\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t0.022781224878066095\n (0, 3739)\t0.03267210725715032
0   1   2
0   B1  !   (0, 353)\t-0.00449726057217602\n (0, 802)\t-0.05617787611978642\n (0, 973)\t-0.10543173735834135\n (0, 1847)\t-0.007780155148241354\n (0, 1989)\t-0.003934155442206846\n (0, 2017)\t-0.005086622578660749\n (0, 2204)\t-0.031113872051853505\n (0, 2405)\t-0.09318613349857544\n (0, 3024)\t-0.005086622578660749\n (0, 3283)\t-0.10089509076272042\n (0, 4556)\t-0.00449726057217602\n (0, 5175)\t-0.005086622578660749\n (0, 5454)\t-0.32011264216698354\n (0, 5724)\t-0.003934155442206846\n (0, 6015)\t-0.005086622578660749\n (0, 6330)\t-0.005086622578660749\n (0, 6473)\t-0.004194952284695256\n (0, 6534)\t-0.19221655261459114\n (0, 6582)\t-0.031591903060786936\n (0, 7980)\t-0.32174386411546047\n (0, 7992)\t-0.004825825736172337\n (0, 9514)\t-0.17326784128005032\n (0, 9556)\t-0.08135115057424913\n (0, 9654)\t-0.004194952284695256\n (0, 9746)\t-0.24722791235969363\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.018006244391458114\n (0, 2945)\t-0.025529859622348973\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.018006244391458114\n (0, 4711)\t-0.012262917681600417\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t-0.05074854733491452\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.003648836702785919\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.20173441895833755\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t-0.049447249347229494\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t-0.032859710207201465\n (0, 9630)\t0.022781224878066095
0   1   2
0   A2  !   (0, 1510)\t-0.047241319436499236\n (0, 4554)\t-0.09138323899895806\n (0, 5454)\t-0.0062230357565567634\n (0, 7756)\t-0.061785302573242856\n (0, 281)\t-0.01653184338009155\n (0, 351)\t-0.01653184338009155\n (0, 450)\t-0.3274464832370879\n (0, 809)\t-0.013387638815271769\n (0, 2051)\t-0.014616379782250303\n (0, 2586)\t-0.01653184338009155\n (0, 2741)\t-0.15810190062867993\n (0, 3224)\t-0.06225932224260644\n (0, 3373)\t-0.12280247879038902\n (0, 3421)\t-0.015684237235273946\n (0, 3819)\t-0.01653184338009155\n (0, 3833)\t-0.3359646619748352\n (0, 4068)\t-0.015684237235273946\n (0, 4402)\t-0.07152757844346042\n (0, 4649)\t-0.3279430542171356\n (0, 5524)\t-0.0899771265578215\n (0, 5790)\t-0.3885263430136202\n (0, 7822)\t-0.059872091526754725\n (0, 505)\t-0.0711692477199759\n (0, 5724)\t-0.16023961429736408\n (0, 6286)\t-0.049366239531379814\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.08531697506522545\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.007931536811222977\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t-0.10456248357681736\n (0, 6865)\t-0.03968381644376268\n (0, 6980)\t-0.11581955114710678\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t-0.28868175238220484\n (0, 7606)\t-0.03988088938576907\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t0.008341811597571965\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t-0.0659929953660953\n (0, 9630)\t-0.05573508827608763
0   1   2
0   B2  !   (0, 1604)\t-0.5452053299558446\n (0, 1611)\t-0.14349203210584277\n (0, 1786)\t-0.07926751381540288\n (0, 4402)\t-0.061638227000430465\n (0, 4469)\t-0.18047516283733558\n (0, 4483)\t-0.12632546444958545\n (0, 7467)\t-0.1657501793150448\n (0, 7953)\t-0.2027592110690899\n (0, 7991)\t-0.0705445978132748\n (0, 9157)\t-0.1576966747397613\n (0, 9746)\t-0.13158162095004766\n (0, 776)\t-0.07804759361864515\n (0, 1432)\t-0.04319046246215665\n (0, 1630)\t-0.06742619934269474\n (0, 1903)\t-0.03634857244837165\n (0, 2742)\t-0.04319046246215665\n (0, 2816)\t-0.15050859335152222\n (0, 3562)\t-0.03940488059869191\n (0, 4318)\t-0.04097603902861966\n (0, 4490)\t-0.04319046246215665\n (0, 5187)\t-0.27333877764907855\n (0, 5252)\t-0.04319046246215665\n (0, 5551)\t-0.22302927657831634\n (0, 5790)\t-0.18300512305356684\n (0, 5852)\t-0.029396346557071712\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t-0.05047091757718261\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.09066783596741043\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.08208799126770042\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t-0.011737613363530505\n (0, 8634)\t-0.014588840340588008\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t-0.05074822281767788
0   1   2
0   C1  !   (0, 244)\t-0.09884319674162795\n (0, 690)\t-0.1388650034822139\n (0, 960)\t-0.10605470450461775\n (0, 1373)\t-0.29793485494660743\n (0, 1584)\t-0.15220560572907585\n (0, 1603)\t-0.15220560572907585\n (0, 1604)\t-0.2943386139167361\n (0, 1638)\t-0.15220560572907585\n (0, 2080)\t-0.1444018536776252\n (0, 2680)\t-0.22203398402397742\n (0, 2722)\t-0.1388650034822139\n (0, 2774)\t-0.15220560572907585\n (0, 2822)\t-0.13106125143076322\n (0, 3071)\t-0.1444018536776252\n (0, 3265)\t-0.2691405776324631\n (0, 3627)\t-0.1444018536776252\n (0, 4014)\t-0.15220560572907585\n (0, 4073)\t-0.15220560572907585\n (0, 4247)\t-0.15220560572907585\n (0, 4659)\t-0.3056381346962476\n (0, 4726)\t-0.3044112114581517\n (0, 4868)\t-0.15220560572907585\n (0, 5014)\t-0.1388650034822139\n (0, 5074)\t-0.1444018536776252\n (0, 5450)\t-0.1865505674300888\n :\t:\n (0, 2550)\t0.02860215372933612\n (0, 2842)\t0.026095216126881034\n (0, 2945)\t0.020274287275611008\n (0, 3325)\t0.02860215372933612\n (0, 3495)\t0.02860215372933612\n (0, 3539)\t0.027135689240252094\n (0, 4142)\t0.02860215372933612\n (0, 4305)\t0.026095216126881034\n (0, 4711)\t0.025288162480521178\n (0, 5173)\t0.05720430745867224\n (0, 5745)\t0.08580646118800836\n (0, 6561)\t0.025288162480521178\n (0, 6865)\t0.023162287148712983\n (0, 6980)\t-0.09559883514855932\n (0, 7349)\t0.02860215372933612\n (0, 7498)\t0.02860215372933612\n (0, 7573)\t0.024071227714247634\n (0, 7606)\t-0.13840215323517818\n (0, 8034)\t0.02860215372933612\n (0, 8304)\t0.026095216126881034\n (0, 8445)\t0.02183231985676501\n (0, 8634)\t0.027135689240252094\n (0, 9268)\t0.02860215372933612\n (0, 9471)\t0.026095216126881034\n (0, 9630)\t-0.09844846169130346
0   1   2
0   C2  !   (0, 1510)\t-0.05959414411701482\n (0, 1925)\t-0.07930619936349027\n (0, 4554)\t-0.12239420823695288\n (0, 6337)\t-0.10751794751817559\n (0, 6919)\t-0.11905736382524099\n (0, 7940)\t-0.4509514511406135\n (0, 8674)\t-0.06634760477509609\n (0, 8876)\t-0.09955165597499022\n (0, 281)\t-0.11414308129642091\n (0, 351)\t-0.11414308129642091\n (0, 450)\t-0.6510470682457199\n (0, 2051)\t-0.2941575097638065\n (0, 2586)\t-0.11414308129642091\n (0, 2741)\t-0.2826990863413035\n (0, 3224)\t-0.19484491340690077\n (0, 3421)\t-0.15593785912375202\n (0, 3819)\t-0.11414308129642091\n (0, 3833)\t-0.5959237864651308\n (0, 4068)\t-0.17821843998174858\n (0, 7822)\t-0.18737390753407893\n (0, 505)\t-0.17144684026449575\n (0, 1605)\t-0.1761432003933918\n (0, 2087)\t-0.30641937334729397\n (0, 2435)\t-0.01959392434397809\n (0, 2544)\t-0.04205087201608662\n :\t:\n (0, 1543)\t-0.10270397939145476\n (0, 4091)\t0.03510805887202453\n (0, 4621)\t0.03774871894954156\n (0, 5548)\t0.11591078340050759\n (0, 7216)\t0.10996790758187949\n (0, 8462)\t0.11591078340050759\n (0, 6902)\t0.0953038712770817\n (0, 275)\t0.05201608398317632\n (0, 2309)\t-0.11518506286788033\n (0, 5602)\t0.130229016246211\n (0, 6856)\t0.028341275217697887\n (0, 9697)\t0.130229016246211\n (0, 5898)\t0.11137290054484401\n (0, 5921)\t0.09019079917603834\n (0, 6930)\t-0.41244426698962444\n (0, 7468)\t0.11137290054484401\n (0, 8776)\t0.08870699446777193\n (0, 981)\t0.1483908905002444\n (0, 2084)\t0.1483908905002444\n (0, 3159)\t0.02642985841317767\n (0, 3306)\t-0.02982547695328551\n (0, 3508)\t0.055716356911763666\n (0, 8305)\t0.025416449215215475\n (0, 8431)\t0.02642985841317767\n (0, 8682)\t0.027858178455881833

输出格式应该是什么:

bs obećao -4.50534985071
bs pošto -4.50534985071
bs prava -4.50534985071
bs predstavlja -4.50534985071
bs prošlosedmičnom -4.50534985071
bs sjeveru -4.50534985071
bs taj -4.50534985071
bs vladavine -4.50534985071
bs će -4.50534985071
bs da -4.0998847426

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

链接到 How to get most informative features for scikit-learn classifier for different class? 上的第二个回复 与此相关的还有确切的问题 post 作为对此 post 问题的第二个回复的最后回复:

Amazing @alvas I tried the above function but the output looks like this:POS aaeguno móvil (0, 60) -0.0375375709849 (0, 300) -0.0375375709849 (0, 3279) -0.0375375709849 instead of returning the class, followed by the word and the float. Any idea of why this is happening?. Thanks! – newWithPython Mar 15 '15 at 0:45

但是没有人回复这个,而且我的名声很低,我也不能在那里要求更多信息。

它占用了我一周的时间,我真的不能再花太多时间在这上面了。这是我论文的最后一块拼图,它不会是完美的,但我只需要完成它并毕业。所以任何帮助将不胜感激! 也让我知道我可以添加什么来使这个问题更清楚,这可能是我在这个平台上的第三个或第二个。

事实证明,使用 sklearn 的 LinearSVC() 会产生正确的输出,因此 SVC(kernel='linear') 需要其他 ngram 重要性提取方法。我刚切换到 LinearSVC,因为它总体上改进了我的模型。