将 Class 与矢量化词相关联
Associate Class With the Vectorizing Words
我正在对推文列表及其 类 进行 Tensorflow 分类,问题是将推文拆分为单词然后使用 TF-IDF 对其进行矢量化后,单词的长度大于长度classes。
(从 CSV 导入的 DataFrame "example"):
Class Tweet
0 1 ضميان قرب شفتك سيد الخود اخاف اموت فراق ما ابت...
1 5 بعد مرور اسبوع عاد صاحب المزرعه ليقول للديك : ...
2 1 انا لو ابتل على الطبخ والموالح ابرك لي من الحل...
3 1 انا اكثر انسان يصلح يقدم محاضرات عن "كيف تيأس ...
4 1 الاغنيه تخلص بس لمن اغنيها انا لا، ابتل اعيد و...
5 1 اللهم أهدني سُقيا من سمائك أبتل بها ولا أزل.
(将单词转换为 TF-IDF 代码):
mess = "
def text_cleaning(mess):
delpunc = [c for c in mess if c not in string.punctuation]
delpunc = ''.join(delpunc)
return [word for word in delpunc.split() if word.lower() not in
stopwords]
# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])
# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)
如果我print(tweet_tfidf)
输出:
Classify the output:
( Tweet ID, Word ID ) Word Weight
(0, 141) 0.35476981351536396
(0, 91) 0.3015867532506004
(0, 84) 0.3015867532506004
(0, 82) 0.3015867532506004
(0, 77) 0.35476981351536396
(0, 76) 0.3015867532506004
(0, 69) 0.3015867532506004
(0, 36) 0.3015867532506004
(0, 25) 0.3015867532506004
(0, 11) 0.3015867532506004
(0, 5) 0.14366697931897693
(1, 142) 0.335452510590434
(1, 129) 0.335452510590434
(1, 125) 0.335452510590434
(1, 103) 0.2851652809360297
(1, 42) 0.335452510590434
(1, 41) 0.335452510590434
(1, 18) 0.335452510590434
(1, 14) 0.335452510590434
(1, 6) 0.335452510590434
(1, 5) 0.13584427723416684
(2, 119) 0.2504289625926897
(2, 118) 0.2504289625926897
(2, 117) 0.2504289625926897
(2, 93) 0.2504289625926897
: :
(8, 62) 0.1770906272241602
(8, 55) 0.3541812544483204
(8, 51) 0.3541812544483204
(8, 48) 0.1770906272241602
(8, 43) 0.1770906272241602
(8, 40) 0.1770906272241602
(8, 39) 0.1770906272241602
(8, 37) 0.1770906272241602
(8, 35) 0.1770906272241602
(8, 32) 0.1770906272241602
(8, 24) 0.1770906272241602
(8, 21) 0.1770906272241602
(8, 5) 0.07171431872090847
(9, 123) 0.29928865657458936
(9, 114) 0.29928865657458936
(9, 105) 0.29928865657458936
(9, 100) 0.29928865657458936
(9, 89) 0.29928865657458936
(9, 59) 0.29928865657458936
(9, 49) 0.29928865657458936
(9, 20) 0.29928865657458936
(9, 17) 0.29928865657458936
(9, 15) 0.29928865657458936
(9, 10) 0.29928865657458936
(9, 5) 0.12119942451824135
type(tweet_tfidf)
是:
scipy.sparse.csr.csr_matrix
在 tensorflow 中,您应该有 训练文本和训练 class .. 我有训练文本,我没有接受过培训 class。
我想要一个 DataFrame,其词重与正确的 class 相关联,例如:
( Tweet ID, Word ID ) ... Word Weight ... Class
(0, 141) 0.35476981351536396 1
(0, 91) 0.3015867532506004 1
(0, 84) 0.3015867532506004 1
(0, 82) 0.3015867532506004 1
(0, 77) 0.35476981351536396 1
(0, 76) 0.3015867532506004 1
(0, 69) 0.3015867532506004 1
(0, 36) 0.3015867532506004 1
(0, 25) 0.3015867532506004 1
(0, 11) 0.3015867532506004 1
(0, 5) 0.14366697931897693 1
(1, 142) 0.335452510590434 5
(1, 129) 0.335452510590434 5
(1, 125) 0.335452510590434 5
(1, 103) 0.2851652809360297 5
(1, 42) 0.335452510590434 5
(1, 41) 0.335452510590434 5
(1, 18) 0.335452510590434 5
(1, 14) 0.335452510590434 5
(1, 6) 0.335452510590434 5
(1, 5) 0.13584427723416684 5
这需要一点操作。你需要 -
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import string
import numpy as np
tweet = pd.read_csv('sample.csv', encoding="ISO-8859-1")
mess = ''
stopwords = []
def text_cleaning(mess):
delpunc = [c for c in mess if c not in string.punctuation]
delpunc = ''.join(delpunc)
return [word for word in delpunc.split() if word.lower() not in
stopwords]
# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])
# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)
ind_mapping = dict(zip(tweet.index, tweet.Class))
print(ind_mapping)
import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))
输出
row_index column_index tf_idf class
0 5 0 0.339570 1
1 5 1 0.339570 1
2 5 2 0.339570 1
3 0 3 0.333333 1
4 2 4 0.283865 1
5 4 4 0.268247 1
6 2 5 0.346171 1
7 0 6 0.333333 1
8 1 7 0.353553 5
9 4 8 0.327125 1
10 4 9 0.327125 1
11 3 10 0.339570 1
12 4 11 0.327125 1
13 2 12 0.346171 1
14 0 13 0.333333 1
15 2 14 0.346171 1
16 5 15 0.339570 1
17 1 16 0.353553 5
18 0 17 0.333333 1
19 3 18 0.278453 1
20 4 18 0.268247 1
21 3 19 0.339570 1
22 4 20 0.327125 1
23 1 21 0.353553 5
24 5 22 0.339570 1
25 4 23 0.327125 1
26 3 24 0.339570 1
27 5 25 0.339570 1
28 0 26 0.333333 1
29 5 27 0.339570 1
30 0 28 0.333333 1
31 1 29 0.353553 5
32 1 30 0.353553 5
33 2 31 0.346171 1
34 3 32 0.339570 1
35 0 33 0.333333 1
36 0 34 0.333333 1
37 3 35 0.339570 1
38 4 36 0.327125 1
39 1 37 0.353553 5
40 4 38 0.327125 1
41 2 39 0.346171 1
42 2 40 0.346171 1
43 1 41 0.353553 5
44 0 42 0.333333 1
45 3 43 0.339570 1
46 1 44 0.353553 5
47 2 45 0.283865 1
48 5 45 0.278453 1
49 4 46 0.327125 1
50 2 47 0.346171 1
51 5 48 0.339570 1
52 3 49 0.339570 1
53 3 50 0.339570 1
说明
创建索引和 类 -
的映射
ind_mapping = dict(zip(tweet.index, tweet.Class))
获取 row_index
、column_index
和 tf_idf
值 -
import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
将值和映射转换为 dataframe
-
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))
我正在对推文列表及其 类 进行 Tensorflow 分类,问题是将推文拆分为单词然后使用 TF-IDF 对其进行矢量化后,单词的长度大于长度classes。
(从 CSV 导入的 DataFrame "example"):
Class Tweet
0 1 ضميان قرب شفتك سيد الخود اخاف اموت فراق ما ابت...
1 5 بعد مرور اسبوع عاد صاحب المزرعه ليقول للديك : ...
2 1 انا لو ابتل على الطبخ والموالح ابرك لي من الحل...
3 1 انا اكثر انسان يصلح يقدم محاضرات عن "كيف تيأس ...
4 1 الاغنيه تخلص بس لمن اغنيها انا لا، ابتل اعيد و...
5 1 اللهم أهدني سُقيا من سمائك أبتل بها ولا أزل.
(将单词转换为 TF-IDF 代码):
mess = "
def text_cleaning(mess):
delpunc = [c for c in mess if c not in string.punctuation]
delpunc = ''.join(delpunc)
return [word for word in delpunc.split() if word.lower() not in
stopwords]
# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])
# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)
如果我print(tweet_tfidf)
输出:
Classify the output:
( Tweet ID, Word ID ) Word Weight
(0, 141) 0.35476981351536396
(0, 91) 0.3015867532506004
(0, 84) 0.3015867532506004
(0, 82) 0.3015867532506004
(0, 77) 0.35476981351536396
(0, 76) 0.3015867532506004
(0, 69) 0.3015867532506004
(0, 36) 0.3015867532506004
(0, 25) 0.3015867532506004
(0, 11) 0.3015867532506004
(0, 5) 0.14366697931897693
(1, 142) 0.335452510590434
(1, 129) 0.335452510590434
(1, 125) 0.335452510590434
(1, 103) 0.2851652809360297
(1, 42) 0.335452510590434
(1, 41) 0.335452510590434
(1, 18) 0.335452510590434
(1, 14) 0.335452510590434
(1, 6) 0.335452510590434
(1, 5) 0.13584427723416684
(2, 119) 0.2504289625926897
(2, 118) 0.2504289625926897
(2, 117) 0.2504289625926897
(2, 93) 0.2504289625926897
: :
(8, 62) 0.1770906272241602
(8, 55) 0.3541812544483204
(8, 51) 0.3541812544483204
(8, 48) 0.1770906272241602
(8, 43) 0.1770906272241602
(8, 40) 0.1770906272241602
(8, 39) 0.1770906272241602
(8, 37) 0.1770906272241602
(8, 35) 0.1770906272241602
(8, 32) 0.1770906272241602
(8, 24) 0.1770906272241602
(8, 21) 0.1770906272241602
(8, 5) 0.07171431872090847
(9, 123) 0.29928865657458936
(9, 114) 0.29928865657458936
(9, 105) 0.29928865657458936
(9, 100) 0.29928865657458936
(9, 89) 0.29928865657458936
(9, 59) 0.29928865657458936
(9, 49) 0.29928865657458936
(9, 20) 0.29928865657458936
(9, 17) 0.29928865657458936
(9, 15) 0.29928865657458936
(9, 10) 0.29928865657458936
(9, 5) 0.12119942451824135
type(tweet_tfidf)
是:
scipy.sparse.csr.csr_matrix
在 tensorflow 中,您应该有 训练文本和训练 class .. 我有训练文本,我没有接受过培训 class。 我想要一个 DataFrame,其词重与正确的 class 相关联,例如:
( Tweet ID, Word ID ) ... Word Weight ... Class
(0, 141) 0.35476981351536396 1
(0, 91) 0.3015867532506004 1
(0, 84) 0.3015867532506004 1
(0, 82) 0.3015867532506004 1
(0, 77) 0.35476981351536396 1
(0, 76) 0.3015867532506004 1
(0, 69) 0.3015867532506004 1
(0, 36) 0.3015867532506004 1
(0, 25) 0.3015867532506004 1
(0, 11) 0.3015867532506004 1
(0, 5) 0.14366697931897693 1
(1, 142) 0.335452510590434 5
(1, 129) 0.335452510590434 5
(1, 125) 0.335452510590434 5
(1, 103) 0.2851652809360297 5
(1, 42) 0.335452510590434 5
(1, 41) 0.335452510590434 5
(1, 18) 0.335452510590434 5
(1, 14) 0.335452510590434 5
(1, 6) 0.335452510590434 5
(1, 5) 0.13584427723416684 5
这需要一点操作。你需要 -
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import string
import numpy as np
tweet = pd.read_csv('sample.csv', encoding="ISO-8859-1")
mess = ''
stopwords = []
def text_cleaning(mess):
delpunc = [c for c in mess if c not in string.punctuation]
delpunc = ''.join(delpunc)
return [word for word in delpunc.split() if word.lower() not in
stopwords]
# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])
# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)
ind_mapping = dict(zip(tweet.index, tweet.Class))
print(ind_mapping)
import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))
输出
row_index column_index tf_idf class
0 5 0 0.339570 1
1 5 1 0.339570 1
2 5 2 0.339570 1
3 0 3 0.333333 1
4 2 4 0.283865 1
5 4 4 0.268247 1
6 2 5 0.346171 1
7 0 6 0.333333 1
8 1 7 0.353553 5
9 4 8 0.327125 1
10 4 9 0.327125 1
11 3 10 0.339570 1
12 4 11 0.327125 1
13 2 12 0.346171 1
14 0 13 0.333333 1
15 2 14 0.346171 1
16 5 15 0.339570 1
17 1 16 0.353553 5
18 0 17 0.333333 1
19 3 18 0.278453 1
20 4 18 0.268247 1
21 3 19 0.339570 1
22 4 20 0.327125 1
23 1 21 0.353553 5
24 5 22 0.339570 1
25 4 23 0.327125 1
26 3 24 0.339570 1
27 5 25 0.339570 1
28 0 26 0.333333 1
29 5 27 0.339570 1
30 0 28 0.333333 1
31 1 29 0.353553 5
32 1 30 0.353553 5
33 2 31 0.346171 1
34 3 32 0.339570 1
35 0 33 0.333333 1
36 0 34 0.333333 1
37 3 35 0.339570 1
38 4 36 0.327125 1
39 1 37 0.353553 5
40 4 38 0.327125 1
41 2 39 0.346171 1
42 2 40 0.346171 1
43 1 41 0.353553 5
44 0 42 0.333333 1
45 3 43 0.339570 1
46 1 44 0.353553 5
47 2 45 0.283865 1
48 5 45 0.278453 1
49 4 46 0.327125 1
50 2 47 0.346171 1
51 5 48 0.339570 1
52 3 49 0.339570 1
53 3 50 0.339570 1
说明
创建索引和 类 -
的映射ind_mapping = dict(zip(tweet.index, tweet.Class))
获取 row_index
、column_index
和 tf_idf
值 -
import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
将值和映射转换为 dataframe
-
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))