Tensorflow 词汇处理器
Tensorflow vocabularyprocessor
我正在关注有关使用 tensorflow 进行文本分类的 wildml 博客。我无法理解代码语句中 max_document_length 的用途:
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
还有我如何从 vocab_processor
中提取词汇
我已经弄清楚如何从 vocabularyprocessor 对象中提取词汇。这对我来说非常有效。
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
VocabularyProcessor
将您的文本文档映射为向量,并且您需要这些向量具有一致的长度。
您的输入数据记录的长度可能不同(或可能不会)。例如,如果您正在处理用于情感分析的句子,它们将具有不同的长度。
您将此参数提供给 VocabularyProcessor
以便它可以调整输出向量的长度。根据the documentation、
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
查看 source code。
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
注意行 word_ids = np.zeros(self.max_document_length)
。
raw_documents
变量中的每一行都将映射到长度为 max_document_length
的向量。
我正在关注有关使用 tensorflow 进行文本分类的 wildml 博客。我无法理解代码语句中 max_document_length 的用途:
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
还有我如何从 vocab_processor
中提取词汇我已经弄清楚如何从 vocabularyprocessor 对象中提取词汇。这对我来说非常有效。
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
VocabularyProcessor
将您的文本文档映射为向量,并且您需要这些向量具有一致的长度。
您的输入数据记录的长度可能不同(或可能不会)。例如,如果您正在处理用于情感分析的句子,它们将具有不同的长度。
您将此参数提供给 VocabularyProcessor
以便它可以调整输出向量的长度。根据the documentation、
max_document_length: Maximum length of documents. if documents are longer, they will be trimmed, if shorter - padded.
查看 source code。
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
注意行 word_ids = np.zeros(self.max_document_length)
。
raw_documents
变量中的每一行都将映射到长度为 max_document_length
的向量。