理想情况下,tfidf 矩阵是什么

What is the tfidf matrix giving ideally

当我 运行 一组文档的 tfidf 时,它返回给我一个 tfidf 矩阵,看起来像这样。

(1, 12) 0.656240233446
  (1, 11)   0.754552023393
  (2, 6)    1.0
  (3, 13)   1.0
  (4, 2)    1.0
  (7, 9)    1.0
  (9, 4)    0.742540927053
  (9, 5)    0.66980069547
  (11, 19)  0.735138466738
  (11, 7)   0.677916982176
  (12, 18)  1.0
  (13, 14)  0.697455191865
  (13, 11)  0.716628394177
  (14, 5)   1.0
  (15, 8)   1.0
  (16, 17)  1.0
  (18, 1)   1.0
  (19, 17)  1.0
  (22, 13)  1.0
  (23, 3)   1.0
  (25, 6)   1.0
  (26, 19)  0.476648253537
  (26, 7)   0.879094103268
  (28, 10)  0.532672175403
  (28, 7)   0.523456282204

我想知道这是什么,我无法理解这是如何提供的。 当我处于调试模式时,我开始了解索引、indptr 和数据……这些东西与给定的数据相关。这些是什么? 数字有很多混乱,如果我说括号中的第一个元素是基于我预测的文档,我看不到第 0、5、6 个文档。 请帮我弄清楚它是如何在这里工作的。但是我从 wiki 知道 tfidf 的一般工作,记录反向文件和其他东西。我只想知道这 3 种不同类型的数字是什么,指的是什么?

源代码是:

#This contains the list of file names 
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():   
   def kmeansClusters(self):
        global _report
            self.num_clusters = 5
            km = KMeans(n_clusters=self.num_clusters)
            vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
            self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
            km.fit(self.tfidf_matrix)
            self.clusters = km.labels_.tolist()
            joblib.dump(km, 'doc_cluster2.pkl')
            km = joblib.load('doc_cluster2.pkl')

class TokenizingAndPanda():

    def tokenize_only(self,text):
        '''
        This function tokenizes the text
        :param text: Give the text that you want to tokenize
        :return: it gives the filter tokes
        '''
        # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
        tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
        filtered_tokens = []
        # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)
        return filtered_tokens

    def tokenize_and_stem(self,text):
        # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
        tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
        filtered_tokens = []
        # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)
        stems = [_stemmer.stem(t) for t in filtered_tokens]
        return stems

    def getFilnames(self):
        '''

        :return:
        '''
        global _path
        global _filenames
        path = _path
        _filenames = FileAccess().read_all_file_names(path)


    def getContentsForFilenames(self):
        global _contents
        global _file_contents
        for filename in _filenames:
            content = FileAccess().read_the_contents_from_files(_path, filename)
            _contents.append(content)
            _file_contents[filename] = content

    def createPandaVocabFrame(self):
        global _totalvocab_stemmed
        global _totalvocab_tokenized
        #Enable this if you want to load the filenames and contents from a file structure.
        # self.getFilnames()
        # self.getContentsForFilenames()

        # for name, i in _file_contents.items():
        #     print(name)
        #     print(i)
        for i in _contents:
            allwords_stemmed = self.tokenize_and_stem(i)
            _totalvocab_stemmed.extend(allwords_stemmed)

            allwords_tokenized = self.tokenize_only(i)
            _totalvocab_tokenized.extend(allwords_tokenized)
        vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
        print(vocab_frame)
        return vocab_frame


class TfidfProcessing():

    def getTfidFPropertyData(self):
        tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
                                           min_df=0.02, stop_words='english',
                                           use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
        # print(_contents)
        tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
        terms = tfidf_vectorizer.get_feature_names()
        dist = 1 - cosine_similarity(tfidf_matrix)

        return tfidf_matrix, terms, dist

tfidf 应用于数据的结果通常是一个二维矩阵 A,其中 A_ij 是第 i 个文档中归一化的第 j 个词(词)频率。您在输出中看到的是此矩阵的 sparse 表示,换句话说 - 仅打印出非零元素,因此:

(1, 12) 0.656240233446

表示第 12 个单词(根据 sklearn 构建的一些词汇表)在第一个文档中的归一化频率为 0.656240233446。 "missing" 位为零,这意味着例如在第一个文档中找不到第 3 个单词(因为没有 (1,3))等等。

有些文档丢失是您的特殊 code/data(您没有包括)造成的,也许您是手动设置词汇表的?或考虑的最大特征数? TfidfVectorizer 中有许多参数可能会导致这种情况,但如果没有您的确切代码(和一些示例性数据),则无话可说。例如设置 min_df 可能会导致(因为它会丢弃非常罕见的单词)类似地 max_features (同样的效果)