ML 代理未学习相对 'simple' 的任务
ML agent not learning a relatively 'simple' task
我尝试创建一个简单的 ML 代理(球)来学习朝向目标移动并与目标发生碰撞。
不幸的是,智能体似乎没有学习,只是一直在看似随机的位置四处移动。 5M步后,平均奖励保持在-1。
对我做错了什么有什么建议吗?
Tensorflow Cumulative reward graph
我的观察在这里:
/// <summary>
/// Observations:
/// 1: Distance to nearest target
/// 3: Vector to nearest target
/// 3: Target Position
/// 3: Agent position
/// 1: Agent Velocity X
/// 1: Agent Velocity Y
/// //12 observations in total
/// </summary>
/// <param name="sensor"></param>
public override void CollectObservations(VectorSensor sensor)
{
//If nearest Target is null, observe an empty array and return early
if (target == null)
{
sensor.AddObservation(new float[12]);
return;
}
float distanceToTarget = Vector3.Distance(target.transform.position, this.transform.position);
//Distance to nearest target (1 observervation)
sensor.AddObservation(distanceToTarget);
//Vector to nearest target (3 observations)
Vector3 toTarget = target.transform.position - this.transform.position;
sensor.AddObservation(toTarget.normalized);
//Target position
sensor.AddObservation(target.transform.localPosition);
//Current Position
sensor.AddObservation(this.transform.localPosition);
//Agent Velocities
sensor.AddObservation(rigidbody.velocity.x);
sensor.AddObservation(rigidbody.velocity.y);
}
我的 YAML 文件配置:
behaviors:
PlayerAgent:
trainer_type: ppo
hyperparameters:
batch_size: 512 #128
buffer_size: 2048
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2 #0.2
lambd: 0.99
num_epoch: 3 #3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 32 #256
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 64
learning_rate: 3.0e-4
#keep_checkpoints: 5
#checkpoint_interval: 500000
max_steps: 5000000
time_horizon: 64
summary_freq: 10000
threaded: true
framework: tensorflow
Unity Inspector Component config
奖励(全部在代理脚本上):
private void Update()
{
//If Agent falls off the screen, give negative reward an end episode
if (this.transform.position.y < 0)
{
AddReward(-1.0f);
EndEpisode();
}
if(target != null)
{
Debug.DrawLine(this.transform.position, target.transform.position, Color.green);
}
}
private void OnCollisionEnter(Collision collidedObj)
{
//If agent collides with goal, provide reward
if (collidedObj.gameObject.CompareTag("Goal"))
{
AddReward(1.0f);
Destroy(target);
EndEpisode();
}
}
public override void OnActionReceived(float[] vectorAction)
{
if (!target)
{
//Place and assign the target
envController.PlaceTarget();
target = envController.ProvideTarget();
}
Vector3 controlSignal = Vector3.zero;
controlSignal.x = vectorAction[0];
controlSignal.z = vectorAction[1];
rigidbody.AddForce(controlSignal * moveSpeed, ForceMode.VelocityChange);
// Apply a tiny negative reward every step to encourage action
if (this.MaxStep > 0) AddReward(-1f / this.MaxStep);
}
你说你的环境有多难?如果很少达到目标,代理将无法学习。在这种情况下,您需要在智能体朝着正确的方向行事时添加一些内在奖励。这允许代理学习,即使奖励很少。
您设计奖励的方式也可能存在奖励黑客问题。如果智能体无法找到目标以获得更大的奖励,最有效的方法是尽快从平台上掉下来,以免在每个时间步受到小的惩罚。
我尝试创建一个简单的 ML 代理(球)来学习朝向目标移动并与目标发生碰撞。
不幸的是,智能体似乎没有学习,只是一直在看似随机的位置四处移动。 5M步后,平均奖励保持在-1。
对我做错了什么有什么建议吗?
Tensorflow Cumulative reward graph
我的观察在这里:
/// <summary>
/// Observations:
/// 1: Distance to nearest target
/// 3: Vector to nearest target
/// 3: Target Position
/// 3: Agent position
/// 1: Agent Velocity X
/// 1: Agent Velocity Y
/// //12 observations in total
/// </summary>
/// <param name="sensor"></param>
public override void CollectObservations(VectorSensor sensor)
{
//If nearest Target is null, observe an empty array and return early
if (target == null)
{
sensor.AddObservation(new float[12]);
return;
}
float distanceToTarget = Vector3.Distance(target.transform.position, this.transform.position);
//Distance to nearest target (1 observervation)
sensor.AddObservation(distanceToTarget);
//Vector to nearest target (3 observations)
Vector3 toTarget = target.transform.position - this.transform.position;
sensor.AddObservation(toTarget.normalized);
//Target position
sensor.AddObservation(target.transform.localPosition);
//Current Position
sensor.AddObservation(this.transform.localPosition);
//Agent Velocities
sensor.AddObservation(rigidbody.velocity.x);
sensor.AddObservation(rigidbody.velocity.y);
}
我的 YAML 文件配置:
behaviors:
PlayerAgent:
trainer_type: ppo
hyperparameters:
batch_size: 512 #128
buffer_size: 2048
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2 #0.2
lambd: 0.99
num_epoch: 3 #3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 32 #256
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 64
learning_rate: 3.0e-4
#keep_checkpoints: 5
#checkpoint_interval: 500000
max_steps: 5000000
time_horizon: 64
summary_freq: 10000
threaded: true
framework: tensorflow
Unity Inspector Component config
奖励(全部在代理脚本上):
private void Update()
{
//If Agent falls off the screen, give negative reward an end episode
if (this.transform.position.y < 0)
{
AddReward(-1.0f);
EndEpisode();
}
if(target != null)
{
Debug.DrawLine(this.transform.position, target.transform.position, Color.green);
}
}
private void OnCollisionEnter(Collision collidedObj)
{
//If agent collides with goal, provide reward
if (collidedObj.gameObject.CompareTag("Goal"))
{
AddReward(1.0f);
Destroy(target);
EndEpisode();
}
}
public override void OnActionReceived(float[] vectorAction)
{
if (!target)
{
//Place and assign the target
envController.PlaceTarget();
target = envController.ProvideTarget();
}
Vector3 controlSignal = Vector3.zero;
controlSignal.x = vectorAction[0];
controlSignal.z = vectorAction[1];
rigidbody.AddForce(controlSignal * moveSpeed, ForceMode.VelocityChange);
// Apply a tiny negative reward every step to encourage action
if (this.MaxStep > 0) AddReward(-1f / this.MaxStep);
}
你说你的环境有多难?如果很少达到目标,代理将无法学习。在这种情况下,您需要在智能体朝着正确的方向行事时添加一些内在奖励。这允许代理学习,即使奖励很少。
您设计奖励的方式也可能存在奖励黑客问题。如果智能体无法找到目标以获得更大的奖励,最有效的方法是尽快从平台上掉下来,以免在每个时间步受到小的惩罚。