Tensorflow Keras text_to_sequence return 列表列表

Tensorflow Keras text_to_sequence return a list of lists

我在 text_to_sequence tf.keras

中遇到问题
test_data = 'The invention relates to the fields of biotechnology, virology, epidemiology and public health, and is method for obtaining of new inactivated vaccine against coronavirus COVID-19. The essence matter of invention is COVID-19 virus SARS-CoV-2/KZ_Almaty/04.2020 strain isolated on the territory of the Republic of Kazakhstan. The strain of COVID-19 virus according to the optimal cultivation conditions is produced in the Vero cell culture system, inactivated by formaldehyde, clarified by low-speed centrifugation, purified and concentrated by diafiltration on diafiltration unit of Millipore Pellicon Cassette system. Sterilizing filtration is carried out through cascades of filters with a pore diameter of 0.45/0.22 μm. 2 % aluminum hydroxide gel Algidrogel, 85 is added in the obtained virus pool (viral concentrate) to final concentration of 0.5 mg/0.5 ml and bottled in glass vials. The vaccine obtained in this way is safe at intraperitoneal introduction to white mice and intravenously - to rabbits. The vaccine provides 80 % protection against COVID-19 infection for at least 6 months after two vaccinations. The vaccine keeps its properties for 12 months at 4-6°C.'

我有这个字符串测试数据,我正试图从我训练过的模型中预测它的分类。 问题是当我打电话给 text_to_sequence:

test = tf.keras.preprocessing.text.text_to_word_sequence(test_data)
test = token.texts_to_sequences(test)
print(test)

不知何故它 returns 列表的列表而不是单词标记的列表。

[[1], [7726], [1], [13], [7726], [1], [2997], [1], [1], [7509], [1], [1], [1], [1], [4842], [1], [7167], [1], [1], [1], [1], [1], [4842], [1], [1], [1], [8383], [1], [1], [1], [1], [1], [1], [7167], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [5979], [1], [6054], [1], [13], [1], [1], [7509], [1], [13], [1], [1], [1], [1], [1], [14214], [1], [1], [1], [1], [689], [1], [1], [1], [4842], [1], [7167], [1], [1], [1], [1], [1], [1], [1], [7167], [1], [1], [7509], [1], [9204], [1], [1], [1], [1], [1], [7167], [1], [1], [1], [4842], [1], [7167], [1], [1], [5979], [1], [1], [7167], [1], [1], [1], [6054], [1], [1], [1], [7509], [1], [1], [4842], [1], [7167], [6054], [1], [1], [1], [1]]
test = pad_sequences(test, maxlen=max_length, padding='post')
test

所以 max_length 200 的填充输出是这样的:

array([[   1,    0,    0, ...,    0,    0,    0],
       [7726,    0,    0, ...,    0,    0,    0],
       [   1,    0,    0, ...,    0,    0,    0],
       ...,
       [   1,    0,    0, ...,    0,    0,    0],
       [   1,    0,    0, ...,    0,    0,    0],
       [   1,    0,    0, ...,    0,    0,    0]], dtype=int32)

它应该是长度为 200 的单个数组。

我做了一些测试,问题似乎出在 text_to_sequence 哪个 returns 这个错误列表上。

任何想法似乎是什么原因?我应该更改 text_to_sequence 的输入还是有其他解决方案?

如果您已经在使用 class Tokenizer,则不应使用 text_to_word_sequence。由于分词器重复了 text_to_word_sequence 实际做的事情,即分词。尝试这样的事情:

import tensorflow as tf

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=300, filters = ' ', oov_token='UNK')
test_data = 'The invention relates to the fields of biotechnology, virology, epidemiology and public health, and is method for obtaining of new inactivated vaccine against coronavirus COVID-19. The essence matter of invention is COVID-19 virus SARS-CoV-2/KZ_Almaty/04.2020 strain isolated on the territory of the Republic of Kazakhstan. The strain of COVID-19 virus according to the optimal cultivation conditions is produced in the Vero cell culture system, inactivated by formaldehyde, clarified by low-speed centrifugation, purified and concentrated by diafiltration on diafiltration unit of Millipore Pellicon Cassette system. Sterilizing filtration is carried out through cascades of filters with a pore diameter of 0.45/0.22 μm. 2 % aluminum hydroxide gel Algidrogel, 85 is added in the obtained virus pool (viral concentrate) to final concentration of 0.5 mg/0.5 ml and bottled in glass vials. The vaccine obtained in this way is safe at intraperitoneal introduction to white mice and intravenously - to rabbits. The vaccine provides 80 % protection against COVID-19 infection for at least 6 months after two vaccinations. The vaccine keeps its properties for 12 months at 4-6°C.'
test = [test_data]
tokenizer.fit_on_texts(test)
test = tokenizer.texts_to_sequences(test)
test = tf.keras.preprocessing.sequence.pad_sequences(test, maxlen=200, padding='post')

print(test.shape)
# (1, 200)