如何转换数据并计算TFIDF值?
How to transform the data and calculate the TFIDF value?
我的数据格式是:
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2],...}
datas中的每个元素是一个句子,每个数字是一个word.I 想得到每个数字的TFIDF值。如何用sklearn或者其他方式实现?
我的代码:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2]}
vectorizer=CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(datas))
print(tfidf)
我的代码没有 work.Error:
Traceback (most recent call last): File
"C:/Users/zhuowei/Desktop/OpenNE-master/OpenNE-
master/src/openne/buildTree.py", line 103, in <module>
X = vectorizer.fit_transform(datas) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_) File "C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab
for feature in analyze(doc): File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'int' object has no attribute 'lower'
您正在使用 CountVectorizer
,它需要一个可迭代的字符串。类似于:
datas = ['First sentence',
'Second sentence', ...
...
'Yet another sentence']
但是你的数据是列表的列表,这就是错误发生的原因。您需要将内部列表作为字符串,以便 CountVectorizer 工作。你可以这样做:
datas = [' '.join(map(str, x)) for x in datas]
这将导致 datas
像这样:
['1 2 4 6 7', '2 3', '5 6 8 3 5', '2', '93 23 4 5 11 3 5 2']
现在 CountVectorizer
可以使用此表格。但即使那样你也不会得到正确的结果,因为默认 token_pattern
in CountVectorizer:
token_pattern : ’(?u)\b\w\w+\b’
string Regular expression denoting what constitutes a
“token”, only used if analyzer == 'word'. The default regexp select
tokens of 2 or more alphanumeric characters (punctuation is completely
ignored and always treated as a token separator)
为了让它将您的数字视为单词,您需要对其进行更改,以便它可以接受单个字母作为单词:
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
那么应该可以了。但是现在你的数字变成了字符串
我的数据格式是:
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2],...}
datas中的每个元素是一个句子,每个数字是一个word.I 想得到每个数字的TFIDF值。如何用sklearn或者其他方式实现?
我的代码:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2]}
vectorizer=CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(datas))
print(tfidf)
我的代码没有 work.Error:
Traceback (most recent call last): File
"C:/Users/zhuowei/Desktop/OpenNE-master/OpenNE-
master/src/openne/buildTree.py", line 103, in <module>
X = vectorizer.fit_transform(datas) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_) File "C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab
for feature in analyze(doc): File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'int' object has no attribute 'lower'
您正在使用 CountVectorizer
,它需要一个可迭代的字符串。类似于:
datas = ['First sentence',
'Second sentence', ...
...
'Yet another sentence']
但是你的数据是列表的列表,这就是错误发生的原因。您需要将内部列表作为字符串,以便 CountVectorizer 工作。你可以这样做:
datas = [' '.join(map(str, x)) for x in datas]
这将导致 datas
像这样:
['1 2 4 6 7', '2 3', '5 6 8 3 5', '2', '93 23 4 5 11 3 5 2']
现在 CountVectorizer
可以使用此表格。但即使那样你也不会得到正确的结果,因为默认 token_pattern
in CountVectorizer:
token_pattern : ’(?u)\b\w\w+\b’
string Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)
为了让它将您的数字视为单词,您需要对其进行更改,以便它可以接受单个字母作为单词:
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
那么应该可以了。但是现在你的数字变成了字符串