text2vec 的 vocab_vectorizer 输出是函数本身
text2vec's vocab_vectorizer ouput is the function itself
我正在尝试 运行 通过 this 页面上 text2vec
的示例。但是,每当我尝试查看 vocab_vectorizer
函数返回的内容时,它只是函数本身的输出。在我多年的 R 编码中,我以前从未见过这个,但它也感觉很时髦,可以扩展到这个功能之外。有什么指点吗?
> library(data.table)
> data("movie_review")
> setDT(movie_review)
> setkey(movie_review, id)
> set.seed(2016L)
> all_ids <- movie_review$id
> train_ids <- sample(all_ids, 4000)
> test_ids <- setdiff(all_ids, train_ids)
> train <- movie_review[J(train_ids)]
> test <- movie_review[J(test_ids)]
>
> prep_fun <- tolower
> tok_fun <- word_tokenizer
>
> it_train <- itoken(train$review,
+ preprocessor = prep_fun,
+ tokenizer = tok_fun,
+ ids = train$id,
+ progressbar = FALSE)
> vocabulary <- create_vocabulary(it_train)
>
> vec <- text2vec::vocab_vectorizer(vocabulary = vocabulary)
> vec
function (iterator, grow_dtm, skip_grams_window_context, window_size,
weights, binary_cooccurence = FALSE)
{
vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,
attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]],
attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))
setattr(vocab_corpus_ptr, "ids", character(0))
setattr(vocab_corpus_ptr, "class", "VocabCorpus")
corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context,
window_size, weights, binary_cooccurence)
}
<bytecode: 0x7f9c2e3f7380>
<environment: 0x7f9c18970970>
>
vocab_vectorizer 的输出应该是一个函数。我 运行 文档中示例中的函数如下:
data("movie_review")
N = 100
vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L))
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer, n_chunks = 10)
hash_dtm = create_dtm(it, vectorizer)
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer, n_chunks = 10)
v = create_vocabulary(it, c(1L, 1L) )
vectorizer = vocab_vectorizer(v)
vocab_vectorizer的输出:
> vectorizer
function (iterator, grow_dtm, skip_grams_window_context, window_size,
weights, binary_cooccurence = FALSE)
{
vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,
attr(vocabulary, "ngram")[[1]], attr(vocabulary,
"ngram")[[2]], attr(vocabulary, "stopwords"),
attr(vocabulary, "sep_ngram"))
setattr(vocab_corpus_ptr, "ids", character(0))
setattr(vocab_corpus_ptr, "class", "VocabCorpus")
corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context,
window_size, weights, binary_cooccurence)
}
<bytecode: 0x00000147ada65218>
<environment: 0x00000147b2a6dc38>
文档中提到"It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary".
最后,当我 运行 create_dtm(it, vectorizer) 时,我得到了输出
> create_dtm(it, vectorizer)
100 x 5356 sparse Matrix of class "dgCMatrix"
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . ......
5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . ......
6 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . ......
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
..............................
........suppressing 5304 columns and 81 rows in show(); maybe adjust 'options(max.print= *, width = *)'
..............................
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
92 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
93 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
94 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . ......
96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . ......
97 . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
98 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
我希望这能回答你。
我正在尝试 运行 通过 this 页面上 text2vec
的示例。但是,每当我尝试查看 vocab_vectorizer
函数返回的内容时,它只是函数本身的输出。在我多年的 R 编码中,我以前从未见过这个,但它也感觉很时髦,可以扩展到这个功能之外。有什么指点吗?
> library(data.table)
> data("movie_review")
> setDT(movie_review)
> setkey(movie_review, id)
> set.seed(2016L)
> all_ids <- movie_review$id
> train_ids <- sample(all_ids, 4000)
> test_ids <- setdiff(all_ids, train_ids)
> train <- movie_review[J(train_ids)]
> test <- movie_review[J(test_ids)]
>
> prep_fun <- tolower
> tok_fun <- word_tokenizer
>
> it_train <- itoken(train$review,
+ preprocessor = prep_fun,
+ tokenizer = tok_fun,
+ ids = train$id,
+ progressbar = FALSE)
> vocabulary <- create_vocabulary(it_train)
>
> vec <- text2vec::vocab_vectorizer(vocabulary = vocabulary)
> vec
function (iterator, grow_dtm, skip_grams_window_context, window_size,
weights, binary_cooccurence = FALSE)
{
vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,
attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]],
attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))
setattr(vocab_corpus_ptr, "ids", character(0))
setattr(vocab_corpus_ptr, "class", "VocabCorpus")
corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context,
window_size, weights, binary_cooccurence)
}
<bytecode: 0x7f9c2e3f7380>
<environment: 0x7f9c18970970>
>
vocab_vectorizer 的输出应该是一个函数。我 运行 文档中示例中的函数如下:
data("movie_review")
N = 100
vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L))
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer, n_chunks = 10)
hash_dtm = create_dtm(it, vectorizer)
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer, n_chunks = 10)
v = create_vocabulary(it, c(1L, 1L) )
vectorizer = vocab_vectorizer(v)
vocab_vectorizer的输出:
> vectorizer
function (iterator, grow_dtm, skip_grams_window_context, window_size,
weights, binary_cooccurence = FALSE)
{
vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,
attr(vocabulary, "ngram")[[1]], attr(vocabulary,
"ngram")[[2]], attr(vocabulary, "stopwords"),
attr(vocabulary, "sep_ngram"))
setattr(vocab_corpus_ptr, "ids", character(0))
setattr(vocab_corpus_ptr, "class", "VocabCorpus")
corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context,
window_size, weights, binary_cooccurence)
}
<bytecode: 0x00000147ada65218>
<environment: 0x00000147b2a6dc38>
文档中提到"It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary".
最后,当我 运行 create_dtm(it, vectorizer) 时,我得到了输出
> create_dtm(it, vectorizer)
100 x 5356 sparse Matrix of class "dgCMatrix"
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . ......
5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . ......
6 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . ......
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
..............................
........suppressing 5304 columns and 81 rows in show(); maybe adjust 'options(max.print= *, width = *)'
..............................
[[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
92 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
93 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
94 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . ......
96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . ......
97 . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
98 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
我希望这能回答你。