Google mT5-小配置错误,因为number attention heads不是模型维度的分隔符
Google mT5-small configuration error because number attention heads is not divider of model dimension
HuggingFace google/mt5-small模型(https://huggingface.co/google/mt5-small)
的配置文件
定义
{
...
"d_model": 512,
...
"num_heads": 6,
...
}
Link 到配置文件:https://huggingface.co/google/mt5-small/resolve/main/config.json
问题:
据我了解,attention-head的数量应该是模型维度的一个分频器。在此配置文件中显然不是这样。
我是否误解了 self-attention 在 mT5 中的应用?
当我使用 AllenNLP 模型时 (https://github.com/allenai/allennlp-models/blob/main/allennlp_models/generation/models/t5.py)
作为序列到序列模型,我收到一条错误消息
总结:
allennlp.common.checks.ConfigurationError: The hidden size (512) is not a multiple of the number of attention heads (6)
完整
Traceback (most recent call last):
File "/snap/pycharm-professional/269/plugins/python/helpers/pydev/pydevd.py", line 1500, in _exec
runpy._run_module_as_main(module_name, alter_argv=False)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/__main__.py", line 50, in <module>
run()
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/__main__.py", line 46, in run
main(prog="allennlp")
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/__init__.py", line 123, in main
args.func(args)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 112, in train_model_from_args
train_model_from_file(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 178, in train_model_from_file
return train_model(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 254, in train_model
model = _train_worker(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 490, in _train_worker
train_loop = TrainModel.from_params(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 652, in from_params
return retyped_subclass.from_params(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 686, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 766, in from_partial_objects
model_ = model.construct(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 82, in construct
return self.constructor(**contructor_kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 66, in constructor_to_use
return self._constructor.from_params( # type: ignore[union-attr]
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 652, in from_params
return retyped_subclass.from_params(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 686, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp_models/generation/models/t5.py", line 32, in __init__
self.t5 = T5Module.from_pretrained_module(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/transformer_module.py", line 251, in from_pretrained_module
model = cls._from_config(config, **kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/t5.py", line 852, in _from_config
return cls(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/t5.py", line 783, in __init__
self.encoder: T5EncoderStack = encoder.construct(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 82, in construct
return self.constructor(**contructor_kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/t5.py", line 600, in basic_encoder
self_attention=block_self_attention.construct(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 82, in construct
return self.constructor(**contructor_kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 66, in constructor_to_use
return self._constructor.from_params( # type: ignore[union-attr]
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 686, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/attention_module.py", line 471, in __init__
super().__init__(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/attention_module.py", line 91, in __init__
raise ConfigurationError(
allennlp.common.checks.ConfigurationError: The hidden size (512) is not a multiple of the number of attention heads (6)
这是一个非常好的问题,并且显示了对变形金刚的常见误解,源于 original Transformers paper 中的一个(不幸的)公式。特别是,作者在第 3.2.2 节中写了以下内容:
In this work, we employ h = 8
parallel attention layers, or heads. For each of these we use d_k = d_v = d_(model) / h = 64
. [...]
请注意 d_k/d_v = d_(model)
的相等性并非绝对必要;唯一重要的是你在每一层的前馈部分之后匹配最终的隐藏表示(d_(model)
)。专门针对 mt5-small
,作者实际上使用了 384
的内部维度,它只是参数 d_kv * num_heads = 64 * 6
.
的乘积
现在,问题是许多图书馆对 d_kv
和 d_(model)
之间的强制关系做出了类似的假设,因为它节省了大多数人无论如何都不会使用的一些实现工作。我怀疑(不是很熟悉 AllenNLP)他们在这里做了类似的假设,这就是为什么你不能加载模型的原因。
此外,为了澄清这一点,这里是对已加载 mt5-small
:
的 modules
的一瞥
T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=384, bias=False)
(k): Linear(in_features=512, out_features=384, bias=False)
(v): Linear(in_features=512, out_features=384, bias=False)
(o): Linear(in_features=384, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseGatedGeluDense(
(wi_0): Linear(in_features=512, out_features=1024, bias=False)
(wi_1): Linear(in_features=512, out_features=1024, bias=False)
(wo): Linear(in_features=1024, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
您只需调用 list(model.modules())
即可获得完整的模型布局
HuggingFace google/mt5-small模型(https://huggingface.co/google/mt5-small)
的配置文件定义
{
...
"d_model": 512,
...
"num_heads": 6,
...
}
Link 到配置文件:https://huggingface.co/google/mt5-small/resolve/main/config.json
问题:
据我了解,attention-head的数量应该是模型维度的一个分频器。在此配置文件中显然不是这样。
我是否误解了 self-attention 在 mT5 中的应用?
当我使用 AllenNLP 模型时 (https://github.com/allenai/allennlp-models/blob/main/allennlp_models/generation/models/t5.py) 作为序列到序列模型,我收到一条错误消息
总结:
allennlp.common.checks.ConfigurationError: The hidden size (512) is not a multiple of the number of attention heads (6)
完整
Traceback (most recent call last):
File "/snap/pycharm-professional/269/plugins/python/helpers/pydev/pydevd.py", line 1500, in _exec
runpy._run_module_as_main(module_name, alter_argv=False)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/__main__.py", line 50, in <module>
run()
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/__main__.py", line 46, in run
main(prog="allennlp")
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/__init__.py", line 123, in main
args.func(args)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 112, in train_model_from_args
train_model_from_file(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 178, in train_model_from_file
return train_model(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 254, in train_model
model = _train_worker(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 490, in _train_worker
train_loop = TrainModel.from_params(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 652, in from_params
return retyped_subclass.from_params(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 686, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/commands/train.py", line 766, in from_partial_objects
model_ = model.construct(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 82, in construct
return self.constructor(**contructor_kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 66, in constructor_to_use
return self._constructor.from_params( # type: ignore[union-attr]
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 652, in from_params
return retyped_subclass.from_params(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 686, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp_models/generation/models/t5.py", line 32, in __init__
self.t5 = T5Module.from_pretrained_module(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/transformer_module.py", line 251, in from_pretrained_module
model = cls._from_config(config, **kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/t5.py", line 852, in _from_config
return cls(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/t5.py", line 783, in __init__
self.encoder: T5EncoderStack = encoder.construct(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 82, in construct
return self.constructor(**contructor_kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/t5.py", line 600, in basic_encoder
self_attention=block_self_attention.construct(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 82, in construct
return self.constructor(**contructor_kwargs)
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/lazy.py", line 66, in constructor_to_use
return self._constructor.from_params( # type: ignore[union-attr]
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/common/from_params.py", line 686, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/attention_module.py", line 471, in __init__
super().__init__(
File "/home/lars/anaconda3/envs/mare2/lib/python3.9/site-packages/allennlp/modules/transformer/attention_module.py", line 91, in __init__
raise ConfigurationError(
allennlp.common.checks.ConfigurationError: The hidden size (512) is not a multiple of the number of attention heads (6)
这是一个非常好的问题,并且显示了对变形金刚的常见误解,源于 original Transformers paper 中的一个(不幸的)公式。特别是,作者在第 3.2.2 节中写了以下内容:
In this work, we employ
h = 8
parallel attention layers, or heads. For each of these we used_k = d_v = d_(model) / h = 64
. [...]
请注意 d_k/d_v = d_(model)
的相等性并非绝对必要;唯一重要的是你在每一层的前馈部分之后匹配最终的隐藏表示(d_(model)
)。专门针对 mt5-small
,作者实际上使用了 384
的内部维度,它只是参数 d_kv * num_heads = 64 * 6
.
现在,问题是许多图书馆对 d_kv
和 d_(model)
之间的强制关系做出了类似的假设,因为它节省了大多数人无论如何都不会使用的一些实现工作。我怀疑(不是很熟悉 AllenNLP)他们在这里做了类似的假设,这就是为什么你不能加载模型的原因。
此外,为了澄清这一点,这里是对已加载 mt5-small
:
modules
的一瞥
T5Block(
(layer): ModuleList(
(0): T5LayerSelfAttention(
(SelfAttention): T5Attention(
(q): Linear(in_features=512, out_features=384, bias=False)
(k): Linear(in_features=512, out_features=384, bias=False)
(v): Linear(in_features=512, out_features=384, bias=False)
(o): Linear(in_features=384, out_features=512, bias=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): T5LayerFF(
(DenseReluDense): T5DenseGatedGeluDense(
(wi_0): Linear(in_features=512, out_features=1024, bias=False)
(wi_1): Linear(in_features=512, out_features=1024, bias=False)
(wo): Linear(in_features=1024, out_features=512, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): T5LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
您只需调用 list(model.modules())