Conceptnet Numberbatch(多语言)OOV 词

Conceptnet Numberbatch (multilingual) OOV words

我正在处理一个文本分类问题(在法语语料库上),并且正在试验不同的词嵌入。我对 ConceptNet 提供的内容非常感兴趣,所以我决定试一试。

我找不到针对我的特定任务的专门教程,所以我听取了他们的建议 blog:

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

Work through any tutorial on machine learning for NLP that uses semantic vectors. Get to the part where they tell you to use word2vec. (A particularly enlightened tutorial may tell you to use GloVe 1.2.)

Get the ConceptNet Numberbatch data, and use it instead. Get better results that also generalize to other languages.

下面你可以找到我的方法(注意'numberbatch.txt'是包含推荐的多语言版本的文件:ConceptNet Numberbatch 19.08):

embeddings_index = dict()

f = open('numberbatch.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

我首先测试一个词是否存在:

word = 'fille'
missingWords = 0
if word not in embeddings_index:
    missingWords += 1
print(missingWords)

令我惊讶的是,找不到像“fille”(法语中的女孩)这样的简单词。然后我创建了一个函数来打印我的语料库中的所有 OOV 词。分析结果时更让我吃惊的是:超过22k的词没有找到(包括'nous'(we), 'être'(待定)等)。

我也尝试了 GitHub page 上针对 OOV 词提出的方法(结果相同):

Out-of-vocabulary strategy

ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategy that helps its performance in the presence of unfamiliar words. The strategy is implemented in the ConceptNet code base. It can be summarized as follows:

Given an unknown word whose language is not English, try looking up the equivalently-spelled word in the English embeddings (because English words tend to end up in text of all languages).

Given an unknown word, remove a letter from the end, and see if that is a prefix of known words. If so, average the embeddings of those known words.

If the prefix is still unknown, continue removing letters from the end until a known prefix is found. Give up when a single character remains.

我的方法有问题吗?

您是否考虑了 ConceptNet Numberbatch 的格式?如project's GitHub所示,看起来是这样的:

/c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...

/c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...

这种格式意味着 fille 不会被找到,但是 /c/fr/fille 会被找到。