为什么 Watson NLC 的训练(1024 个字符)和生产(2048 个字符)有不同的大小限制?
Why are there different size limitations in Watson NLC for training (1024 chars) and for production (2048 chars)?
IBM Watson 自然语言分类器 (NLC) 将训练集中的文本值限制为 1024 个字符:
https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits.
然而,经过训练的模型随后可以对长度最多为 2048 个字符的每个文本进行分类:
https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase.
这种差异让我有些困惑:我一直都知道我们应该对训练阶段和生产阶段应用相同的预处理,因此如果我不得不将训练数据限制在 1024 个字符,我会这样做生产中也一样。
我的推理是否正确?我应该将生产中的文本限制在 1024 个字符(我认为应该如此)还是 2048 个字符(可能是因为 1024 个字符太少了)?
提前致谢!
最近,我也有同样的问题,一篇文章的答案也澄清了同样的问题
Currently, the limits are set at 1024 for training and 2048 for
testing/classification. The 1024 limit may require some curation of
the training data prior to training. Most organizations who require
larger character limits for their data end up chunking their input
text into 1024 chunks. Additionally, in use cases with data similar to
the Airbnb reviews, the primary category can typically be assessed
within the first 2048 characters since there is often a lot of noise
in lengthy reviews.
这是文章的link
IBM Watson 自然语言分类器 (NLC) 将训练集中的文本值限制为 1024 个字符: https://console.bluemix.net/docs/services/natural-language-classifier/using-your-data.html#training-limits.
然而,经过训练的模型随后可以对长度最多为 2048 个字符的每个文本进行分类: https://console.bluemix.net/apidocs/natural-language-classifier#classify-a-phrase.
这种差异让我有些困惑:我一直都知道我们应该对训练阶段和生产阶段应用相同的预处理,因此如果我不得不将训练数据限制在 1024 个字符,我会这样做生产中也一样。
我的推理是否正确?我应该将生产中的文本限制在 1024 个字符(我认为应该如此)还是 2048 个字符(可能是因为 1024 个字符太少了)?
提前致谢!
最近,我也有同样的问题,一篇文章的答案也澄清了同样的问题
Currently, the limits are set at 1024 for training and 2048 for testing/classification. The 1024 limit may require some curation of the training data prior to training. Most organizations who require larger character limits for their data end up chunking their input text into 1024 chunks. Additionally, in use cases with data similar to the Airbnb reviews, the primary category can typically be assessed within the first 2048 characters since there is often a lot of noise in lengthy reviews.
这是文章的link