我可以为 fasttext build_vocab 使用与在 Gensim Fasttext 中训练不同的语料库吗?
Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?
我很想知道在调用 Gensim FastText
模型的 build_vocab
和 train
时使用不同的源是否有任何影响。这会影响词嵌入的上下文表示吗?
我这样做的目的是有一组特定的词我有兴趣在调用 model.wv.most_similar
时获得向量表示。我只希望返回此词汇列表中定义的单词,而不是训练语料库中所有可能的单词。我会使用这个结果来决定是否要根据相似度阈值将这些词分组为彼此相关。
以下是我正在使用的代码片段,如果对这种方法有任何疑虑或暗示,请感谢您的想法。
- vocab.txt 包含感兴趣的唯一单词列表
- corpus.txt 包含完整的对话文本(即聊天消息),其中每行代表一个 paragraph/sentence 每个聊天
对此的后续问题是,在这种情况下,我应该在训练期间为 total_examples
和 total_words
设置什么值?
from gensim.models.fasttext import FastText
model = FastText(min_count=1, vector_size=300,)
corpus_path = f'data/{client}-corpus.txt'
vocab_path = f'data/{client}-vocab.txt'
# Unsure if below counts should be based on the training corpus or vocab
corpus_count = get_lines_count(corpus_path)
total_words = get_words_count(corpus_path)
# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=total_words,
)
# save the model
model.save(f'models/gensim-fastext-model-{client}')
万一有人有类似问题,我会把我问这个问题时得到的回复贴在Gensim Disussion Group里供参考:
You can try it, but I wouldn't expect it to work well for most
purposes.
The build_vocab()
call establishes the known vocabulary of the
model, & caches some stats about the corpus.
If you then supply another corpus – & especially one with more words
– then:
- You'll want your
train()
parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples
and total_words
count that are accurate for the training-corpus.
- Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as
well filter your corpus down to just the words-of-interest first, then
use that same filtered corpus for both steps. Will the example texts
still make sense? Will that be enough data to train meaningful,
generalizable word-vectors for just the words-of-interest, alongside
other words-of-interest, without the full texts? (You could look at
your pref-filtered corpus to get a sense of that.) I'm not sure - it
could depend on how severely trimming to just the words-of-interest
changed the corpus. In particular, to train high-dimensional dense
vectors – as with
vector_size=300
– you need a lot of varied data.
Such pre-trimming might thin the corpus so much as to make the
word-vectors for the words-of-interest far less useful.
You could certainly try it both ways – pre-filtered to just your
words-of-interest, or with the full original corpus – and see which
works better on downstream evaluations.
More generally, if the concern is training time with the full corpus,
there are likely other ways to get an adequate model in an acceptable
amount of time.
If using corpus_file
mode, you can increase workers
to equal the
local CPU core count for a nearly-linear speedup from number of cores.
(In traditional corpus_iterable
mode, max throughput is usually
somewhere in the 6-12 workers
threads, as long as you ahve that many
cores.)
min_count=1
is usually a bad idea for these algorithms: they tend to
train faster, in less memory, leaving better vectors for the remaining
words when you discard the lowest-frequency words, as the default
min_count=5
does. (It's possible FastText
can eke a little bit of
benefit out of lower-frequency words via their contribution to
character-n-gram-training, but I'd only ever lower the default
min_count
if I could confirm it was actually improving relevant
results.
If your corpus is so large that training time is a concern, often a
more-aggressive (smaller) sample
parameter value not only speeds
training (by dropping many redundant high-frequency words), but ofthen
improves final word-vector quality for downstream purposes as well (by
letting the rarer words have relatively more influence on the model in
the absense of the downsampled words).
And again if the corpus is so large that training time is a concern,
than epochs=100
is likely overkill. I believe the GoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A
sufficiently large & varied corpus, with plenty of examples of all
words all throughout, could potentially train in 1 pass – because each
word-vector can then get more total training-updates than many epochs
with a small corpus. (In general larger epochs
values are more often
used when the corpus is thin, to eke out something – not on a corpus
so large you're considering non-standard shortcuts to speed the
steps.)
-- Gordon
我很想知道在调用 Gensim FastText
模型的 build_vocab
和 train
时使用不同的源是否有任何影响。这会影响词嵌入的上下文表示吗?
我这样做的目的是有一组特定的词我有兴趣在调用 model.wv.most_similar
时获得向量表示。我只希望返回此词汇列表中定义的单词,而不是训练语料库中所有可能的单词。我会使用这个结果来决定是否要根据相似度阈值将这些词分组为彼此相关。
以下是我正在使用的代码片段,如果对这种方法有任何疑虑或暗示,请感谢您的想法。
- vocab.txt 包含感兴趣的唯一单词列表
- corpus.txt 包含完整的对话文本(即聊天消息),其中每行代表一个 paragraph/sentence 每个聊天
对此的后续问题是,在这种情况下,我应该在训练期间为 total_examples
和 total_words
设置什么值?
from gensim.models.fasttext import FastText
model = FastText(min_count=1, vector_size=300,)
corpus_path = f'data/{client}-corpus.txt'
vocab_path = f'data/{client}-vocab.txt'
# Unsure if below counts should be based on the training corpus or vocab
corpus_count = get_lines_count(corpus_path)
total_words = get_words_count(corpus_path)
# build the vocabulary
model.build_vocab(corpus_file=vocab_path)
# train the model
model.train(corpus_file=corpus.corpus_path, epochs=100,
total_examples=corpus_count, total_words=total_words,
)
# save the model
model.save(f'models/gensim-fastext-model-{client}')
万一有人有类似问题,我会把我问这个问题时得到的回复贴在Gensim Disussion Group里供参考:
You can try it, but I wouldn't expect it to work well for most purposes.
The
build_vocab()
call establishes the known vocabulary of the model, & caches some stats about the corpus.If you then supply another corpus – & especially one with more words – then:
- You'll want your
train()
parameters to reflect the actual size of your training corpus. You'll want to provide a truetotal_examples
andtotal_words
count that are accurate for the training-corpus.- Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as well filter your corpus down to just the words-of-interest first, then use that same filtered corpus for both steps. Will the example texts still make sense? Will that be enough data to train meaningful, generalizable word-vectors for just the words-of-interest, alongside other words-of-interest, without the full texts? (You could look at your pref-filtered corpus to get a sense of that.) I'm not sure - it could depend on how severely trimming to just the words-of-interest changed the corpus. In particular, to train high-dimensional dense vectors – as with
vector_size=300
– you need a lot of varied data. Such pre-trimming might thin the corpus so much as to make the word-vectors for the words-of-interest far less useful.You could certainly try it both ways – pre-filtered to just your words-of-interest, or with the full original corpus – and see which works better on downstream evaluations.
More generally, if the concern is training time with the full corpus, there are likely other ways to get an adequate model in an acceptable amount of time.
If using
corpus_file
mode, you can increaseworkers
to equal the local CPU core count for a nearly-linear speedup from number of cores. (In traditionalcorpus_iterable
mode, max throughput is usually somewhere in the 6-12workers
threads, as long as you ahve that many cores.)
min_count=1
is usually a bad idea for these algorithms: they tend to train faster, in less memory, leaving better vectors for the remaining words when you discard the lowest-frequency words, as the defaultmin_count=5
does. (It's possibleFastText
can eke a little bit of benefit out of lower-frequency words via their contribution to character-n-gram-training, but I'd only ever lower the defaultmin_count
if I could confirm it was actually improving relevant results.If your corpus is so large that training time is a concern, often a more-aggressive (smaller)
sample
parameter value not only speeds training (by dropping many redundant high-frequency words), but ofthen improves final word-vector quality for downstream purposes as well (by letting the rarer words have relatively more influence on the model in the absense of the downsampled words).And again if the corpus is so large that training time is a concern, than
epochs=100
is likely overkill. I believe theGoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A sufficiently large & varied corpus, with plenty of examples of all words all throughout, could potentially train in 1 pass – because each word-vector can then get more total training-updates than many epochs with a small corpus. (In general largerepochs
values are more often used when the corpus is thin, to eke out something – not on a corpus so large you're considering non-standard shortcuts to speed the steps.)-- Gordon