将 Class 与矢量化词相关联

Associate Class With the Vectorizing Words

我正在对推文列表及其 类 进行 Tensorflow 分类,问题是将推文拆分为单词然后使用 TF-IDF 对其进行矢量化后,单词的长度大于长度classes。

(从 CSV 导入的 DataFrame "example"):

   Class                 Tweet
0   1   ضميان قرب شفتك سيد الخود اخاف اموت فراق ما ابت...
1   5   بعد مرور اسبوع عاد صاحب المزرعه ليقول للديك : ...
2   1   انا لو ابتل على الطبخ والموالح ابرك لي من الحل...
3   1   انا اكثر انسان يصلح يقدم محاضرات عن "كيف تيأس ...
4   1   الاغنيه تخلص بس لمن اغنيها انا لا، ابتل اعيد و...
5   1   اللهم أهدني سُقيا من سمائك أبتل بها ولا أزل.

(将单词转换为 TF-IDF 代码):

mess = "

def text_cleaning(mess):
    delpunc = [c for c in mess if c not in string.punctuation]
    delpunc = ''.join(delpunc)
    return [word for word in delpunc.split() if word.lower() not in 
    stopwords]

# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])

# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)

如果我print(tweet_tfidf)输出:

Classify the output:

( Tweet ID, Word ID ) Word Weight

  (0, 141)  0.35476981351536396      
  (0, 91)   0.3015867532506004       
  (0, 84)   0.3015867532506004       
  (0, 82)   0.3015867532506004       
  (0, 77)   0.35476981351536396      
  (0, 76)   0.3015867532506004       
  (0, 69)   0.3015867532506004       
  (0, 36)   0.3015867532506004       
  (0, 25)   0.3015867532506004       
  (0, 11)   0.3015867532506004      
  (0, 5)    0.14366697931897693      
  (1, 142)  0.335452510590434        
  (1, 129)  0.335452510590434        
  (1, 125)  0.335452510590434       
  (1, 103)  0.2851652809360297       
  (1, 42)   0.335452510590434        
  (1, 41)   0.335452510590434        
  (1, 18)   0.335452510590434        
  (1, 14)   0.335452510590434        
  (1, 6)    0.335452510590434        
  (1, 5)    0.13584427723416684      
  (2, 119)  0.2504289625926897
  (2, 118)  0.2504289625926897
  (2, 117)  0.2504289625926897
  (2, 93)   0.2504289625926897
  : :
  (8, 62)   0.1770906272241602
  (8, 55)   0.3541812544483204
  (8, 51)   0.3541812544483204
  (8, 48)   0.1770906272241602
  (8, 43)   0.1770906272241602
  (8, 40)   0.1770906272241602
  (8, 39)   0.1770906272241602
  (8, 37)   0.1770906272241602
  (8, 35)   0.1770906272241602
  (8, 32)   0.1770906272241602
  (8, 24)   0.1770906272241602
  (8, 21)   0.1770906272241602
  (8, 5)    0.07171431872090847
  (9, 123)  0.29928865657458936
  (9, 114)  0.29928865657458936
  (9, 105)  0.29928865657458936
  (9, 100)  0.29928865657458936
  (9, 89)   0.29928865657458936
  (9, 59)   0.29928865657458936
  (9, 49)   0.29928865657458936
  (9, 20)   0.29928865657458936
  (9, 17)   0.29928865657458936
  (9, 15)   0.29928865657458936
  (9, 10)   0.29928865657458936
  (9, 5)    0.12119942451824135

type(tweet_tfidf) 是:

scipy.sparse.csr.csr_matrix

tensorflow 中,您应该有 训练文本和训练 class .. 我有训练文本,我没有接受过培训 class。 我想要一个 DataFrame,其词重与正确的 class 相关联,例如:

( Tweet ID, Word ID ) ... Word Weight ... Class

  (0, 141)  0.35476981351536396      1
  (0, 91)   0.3015867532506004       1
  (0, 84)   0.3015867532506004       1
  (0, 82)   0.3015867532506004       1
  (0, 77)   0.35476981351536396      1
  (0, 76)   0.3015867532506004       1
  (0, 69)   0.3015867532506004       1
  (0, 36)   0.3015867532506004       1
  (0, 25)   0.3015867532506004       1
  (0, 11)   0.3015867532506004       1
  (0, 5)    0.14366697931897693      1
  (1, 142)  0.335452510590434        5
  (1, 129)  0.335452510590434        5
  (1, 125)  0.335452510590434        5
  (1, 103)  0.2851652809360297       5
  (1, 42)   0.335452510590434        5
  (1, 41)   0.335452510590434        5
  (1, 18)   0.335452510590434        5
  (1, 14)   0.335452510590434        5
  (1, 6)    0.335452510590434        5
  (1, 5)    0.13584427723416684      5

这需要一点操作。你需要 -

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import string
import numpy as np

tweet = pd.read_csv('sample.csv', encoding="ISO-8859-1")
mess = ''
stopwords = []

def text_cleaning(mess):
    delpunc = [c for c in mess if c not in string.punctuation]
    delpunc = ''.join(delpunc)
    return [word for word in delpunc.split() if word.lower() not in 
    stopwords]

# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])

# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)

ind_mapping = dict(zip(tweet.index, tweet.Class))
print(ind_mapping)

import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))

输出

     row_index  column_index    tf_idf  class
0           5             0  0.339570      1
1           5             1  0.339570      1
2           5             2  0.339570      1
3           0             3  0.333333      1
4           2             4  0.283865      1
5           4             4  0.268247      1
6           2             5  0.346171      1
7           0             6  0.333333      1
8           1             7  0.353553      5
9           4             8  0.327125      1
10          4             9  0.327125      1
11          3            10  0.339570      1
12          4            11  0.327125      1
13          2            12  0.346171      1
14          0            13  0.333333      1
15          2            14  0.346171      1
16          5            15  0.339570      1
17          1            16  0.353553      5
18          0            17  0.333333      1
19          3            18  0.278453      1
20          4            18  0.268247      1
21          3            19  0.339570      1
22          4            20  0.327125      1
23          1            21  0.353553      5
24          5            22  0.339570      1
25          4            23  0.327125      1
26          3            24  0.339570      1
27          5            25  0.339570      1
28          0            26  0.333333      1
29          5            27  0.339570      1
30          0            28  0.333333      1
31          1            29  0.353553      5
32          1            30  0.353553      5
33          2            31  0.346171      1
34          3            32  0.339570      1
35          0            33  0.333333      1
36          0            34  0.333333      1
37          3            35  0.339570      1
38          4            36  0.327125      1
39          1            37  0.353553      5
40          4            38  0.327125      1
41          2            39  0.346171      1
42          2            40  0.346171      1
43          1            41  0.353553      5
44          0            42  0.333333      1
45          3            43  0.339570      1
46          1            44  0.353553      5
47          2            45  0.283865      1
48          5            45  0.278453      1
49          4            46  0.327125      1
50          2            47  0.346171      1
51          5            48  0.339570      1
52          3            49  0.339570      1
53          3            50  0.339570      1

说明

创建索引和 类 -

的映射
ind_mapping = dict(zip(tweet.index, tweet.Class))

获取 row_indexcolumn_indextf_idf 值 -

import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)

将值和映射转换为 dataframe -

print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))