Tensorflow.js 分词器

Question

我是机器学习和 Tensorflow 的新手，因为我不知道 python 所以我决定使用那里的 javascript 版本（可能更像是包装器）。

问题是我试图构建一个处理自然语言的模型。因此，第一步是对文本进行分词，以便将数据提供给模型。我做了很多研究，但大多数人都在使用 python 版本的 tensorflow，它使用的方法如下：tf.keras.preprocessing.text.Tokenizer，我在 tensorflow.js 中找不到类似的方法。我被困在这一步，不知道如何将文本传输到可以馈送到模型的矢量。请帮助:)

Answer 1

要将文本转换为矢量，有很多方法可以实现，具体取决于用例。最直观的是使用术语频率的方法，即给定语料库的词汇表（所有可能的词），所有文本文档都将表示为一个向量，其中每个条目表示该词在文本文档中的出现。

有了这个词汇:

["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]

以下文字：

["machine", "is", "a", "field", "machine", "is", "is"]

将转换为这个向量：

[2, 0, 3, 1, 0, 1, 0, 0, 0]

这种技术的一个缺点是，与语料库的词汇表大小相同的向量中可能有很多0。这就是为什么还有其他技术。然而 bag of words is often referred to. And there is a slight different version of it using tf.idf

const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
const text = ["machine", "is", "a", "field", "machine", "is", "is"] 
const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
console.log(parse(text))

还有以下module可能有助于实现你想要的

Answer 2

好吧，我遇到了这个问题并按照以下步骤处理了它：

在 tokenizer.fit_on_texts([data]) 之后，在您的 python 代码中打印 tokenizer.word_index。
复制 word_index 输出并将其保存为 json 文件。
引用这个 json 对象来生成标记化的单词，如下所示： function getTokenisedWord(seedWord) { const _token = word2index[seedWord.toLowerCase()] return tf.tensor1d([_token]) }
喂模型： const seedWordToken = getTokenisedWord('Hello'); model.predict(seedWordToken).data().then(predictions => { const resultIdx = tf.argMax(predictions).dataSync()[0]; console.log('Predicted Word ::', index2word[resultIdx]); })
index2word是word2indexjson对象的反向映射。

Answer 3

const vocabulary = ["machine", "learning", "is", "a", "new", "field", "in", "computer", "science"]
const text = ["machine", "is", "a", "field", "machine", "is", "is"] 
const parse = (t) => vocabulary.map((w, i) => t.reduce((a, b) => b === w ? ++a : a , 0))
console.log(parse(text))

Tensorflow.js 分词器

Tensorflow.js tokenizer

javascript

machine-learning

tensorflow.js

natural-language-processing