Rasa Nalu - 理解训练数据

Question

我很难理解 rasa nlu 中的训练数据。假设我想要训练数据，其中有人正在通知某人他们可以购买的动物。为了清楚起见，我将使用降价格式：

假设用户正在回答一个问题：

"What kind of animal would you like to buy?"

表达您想买东西的方式只有这么多。以下面的例子为例：

##intent:inform
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

我是否需要为我打算处理的每种动物重复此操作？像下面这样？

##intent:inform
- [cat](animal)
- [dog](animal)
- [parrot](animal)
- buy [cat](animal)
- buy [dog](animal)
- buy [parrot](animal)
- I would like to buy a [cat](animal)
- I would like to buy a [dog](animal)
- I would like to buy a [parrot](animal)

此外，我注意到在 rasa 的餐厅机器人中，他们有时会一遍又一遍地重复同一个例子，有时多达七次，如下所示：

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)

为什么有必要？这对理解有什么影响？同一个单词在同一位置出现更多次数如何表明它是一个适当的响应，特别是如果您有类似下面的内容，其中同一实体的不同值重复了相同的次数？

##intent:inform
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- [cat](animal)
- buy [cat](animal)
- I would like to buy a [cat](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- [dog](animal)
- buy [dog](animal)
- I would like to buy a [dog](animal)

谢谢，如有任何建议，我们将不胜感激。

Answer 1

Would I need to repeat this for every type of animal I intended to handle? Like below?

不，您不需要指定每种动物。但是尝试为每个意图添加几种动物。例如你的训练样本包含这样的东西：

 - [cat](animal)
 - [dog](animal)
 - [parrot](animal)
 - buy [cat](animal)
 - I would like to buy a [parrot](animal)

当您有一些上下文时，例如：I would like to buy a [animal] 没有必要定义每种动物。相似度算法根据其他关键字查找项目。但是当上下文较少（单词）时，需要定义不同类型的用户输入。

拉萨使用 StarSpace classifier。建议对每个意图使用~10-25个用户样本以获得来自ChatBot的合理响应。

您还可以修改 Rasa classifier 以添加词向量特征（Word2vec 或 Glove）。在这种情况下，一些通用性将被添加到模型中。而类似的概念，比如dog-cat会更容易被检测到。

Answer 2

There are only so many different ways of saying you want to buy something.

你可能会感到惊讶：

我可以买一只狗吗？
我想买一只狗。
我好想要一只狗
如果我有一只狗我会很高兴。
我在找宠物，也许是狗。
买狗
领养狗
养条狗
带狗回家

而且我相信该列表还会继续提供更多示例。话虽这么说，Rasa NLU 应该能够学习和适应少数例子。除了一些例外，adopt 可能与 buy 没有很强的关系，作为例子可能很重要。

Would I need to repeat this for every type of animal I intended to handle? Like below?

不，没有必要。每个动物值都是一个实体，默认情况下 Rasa 使用 CRF 进行实体识别，这就是您在这里所说的。 CRF 更多的是关于句子的结构，而不是单词的值。您可以在 docs and code:

中看到 CRF 查看的特征

  # Available features are:
  # ``low``, ``title``, ``suffix5``, ``suffix3``, ``suffix2``,
  # ``suffix1``, ``pos``, ``pos2``, ``prefix5``, ``prefix2``,
  # ``bias``, ``upper`` and ``digit``
  features: [["low", "title"], ["bias", "suffix3"], ["upper", "pos", "pos2"]]

也就是说，为实体使用不同的值可能是获得额外训练数据的好方法。您可以使用 chatito to generate the training data from patterns. But be careful about repeating patterns as you can overfit 模型之类的工具，使其无法泛化到您训练的模式之外。

they sometimes repeat the same example over and over again

你在 Rasa 数据集中看到了这个？这是默认值 restaurant bot training data，我没有看到任何重复项。

一遍又一遍地重复一个句子会再次强化模型 formats/words 的重要性，这是我上面提到的 oversampling. This can be a good thing if you have very little training data or highly unbalanced training data. It can be a bad thing if you want to handle a lot of different ways to buy a pet as it can overfit 模型的一种形式。

Rasa Nalu - 理解训练数据

Rasa NLU - Understanding Training Data

machine-learning

rasa-nlu

rasa-core