表示低层策略的下一步动作
发布时间:2025-06-24 19:33:50 作者:北方职教升学中心 阅读量:152
上述代码用于了解和学习算法足够了,但若是想直接将上面代码应用于实际项目中,还需要进行修改。
(1) 高层策略:
- 高层Actor选择一个子目标作为低层策略的输入。
每个层次都有自己的 Actor-Critic网络,高层的策略会向低层提供子目标,而低层根据子目标执行具体动作。
背景:分层比非分层有潜力以更高的样本效率解决顺序决策任务,因为分层可以将任务分解为只需要短序列决策的子任务集。
上图是一个简单玩具例子的情节轨迹。
5. HAC的优点与挑战
(1) 优点:
- 适应长时间跨度的任务:通过分层结构,HAC能够处理长时间跨度的任务,尤其在多步决策任务中表现出色。该算法使用了两层的Actor-Critic架构来实现策略和值函数的学习,并通过子任务的分解来降低学习的难度。
- 易于扩展:HAC可以扩展为多层策略架构,增加层数以应对更复杂的任务。由于部分文字、
(2) 低层策略:
- 低层Actor使用高层设定的子目标作为输入,选择当前时间步的具体动作。
2. HAC的网络结构
HAC的网络结构使用了两层的Actor-Critic架构:
- 高层Actor-Critic:负责基于当前的状态选择子目标,时间跨度较长(例如10步)。
环境要求
pip install gym torch numpy
算法训练代码
"""《Hierarchical Actor-Critic (HAC) 算法项目》 时间:2024.10.11 环境:CartPole 作者:不去幼儿园"""import gymimport torchimport torch.nn as nnimport torch.optim as optimimport numpy as npimport random# 超参数GAMMA = 0.99LEARNING_RATE = 0.001EPSILON_DECAY = 0.995MIN_EPSILON = 0.1NUM_EPISODES = 500HIGH_LEVEL_UPDATE_FREQUENCY = 10 # 高层更新频率# Actor-Critic 网络class ActorCritic(nn.Module): def __init__(self, state_dim, action_dim): super(ActorCritic, self).__init__() self.actor = nn.Sequential( nn.Linear(state_dim, 128), nn.ReLU(), nn.Linear(128, action_dim), nn.Softmax(dim=-1) ) self.critic = nn.Sequential( nn.Linear(state_dim, 128), nn.ReLU(), nn.Linear(128, 1) ) def forward(self, state): action_probs = self.actor(state) state_value = self.critic(state) return action_probs, state_value# Hierarchical Actor-Critic 智能体class HACAgent: def __init__(self, state_dim, action_dim, goal_dim): # 高层 Actor-Critic self.high_level_net = ActorCritic(state_dim, goal_dim) # 低层 Actor-Critic self.low_level_net = ActorCritic(state_dim + 1, action_dim) # 优化器 self.high_level_optimizer = optim.Adam(self.high_level_net.parameters(), lr=LEARNING_RATE) self.low_level_optimizer = optim.Adam(self.low_level_net.parameters(), lr=LEARNING_RATE) self.epsilon = 1.0 def select_high_level_goal(self, state, epsilon): if random.random() < epsilon: return random.choice([0, 1]) # 随机选择目标 else: state = torch.FloatTensor(state).unsqueeze(0) action_probs, _ = self.high_level_net(state) return torch.argmax(action_probs).item() def select_low_level_action(self, state, goal, epsilon): if random.random() < epsilon: return random.choice([0, 1]) # 随机选择动作 else: state_goal = torch.cat((torch.FloatTensor(state).unsqueeze(0), torch.FloatTensor([[goal]])), dim=-1) action_probs, _ = self.low_level_net(state_goal) return torch.argmax(action_probs).item() # 修改后的 update_high_level 方法 def update_high_level(self, state, goal, reward, next_state): state = torch.FloatTensor(state).unsqueeze(0) next_state = torch.FloatTensor(next_state).unsqueeze(0) # 分别提取动作概率和状态值 _, state_value = self.high_level_net(state) _, next_state_value = self.high_level_net(next_state) # 确保仅对状态值使用 detach next_state_value = next_state_value.detach() # 计算TD目标值 target_value = reward + GAMMA * next_state_value loss_critic = nn.functional.mse_loss(state_value, target_value) # 计算高层策略损失 action_probs, _ = self.high_level_net(state) log_prob = torch.log(action_probs[0, goal]) advantage = target_value - state_value loss_actor = -log_prob * advantage loss = loss_critic + loss_actor self.high_level_optimizer.zero_grad() loss.backward() self.high_level_optimizer.step() # 修改后的 update_low_level 方法 def update_low_level(self, state, goal, action, reward, next_state): state_goal = torch.cat((torch.FloatTensor(state).unsqueeze(0), torch.FloatTensor([[goal]])), dim=-1) next_state_goal = torch.cat((torch.FloatTensor(next_state).unsqueeze(0), torch.FloatTensor([[goal]])), dim=-1) # 分别提取动作概率和状态值 _, state_value = self.low_level_net(state_goal) _, next_state_value = self.low_level_net(next_state_goal) # 确保仅对状态值使用 detach next_state_value = next_state_value.detach() # 计算TD目标值 target_value = reward + GAMMA * next_state_value loss_critic = nn.functional.mse_loss(state_value, target_value) # 计算低层策略损失 action_probs, _ = self.low_level_net(state_goal) log_prob = torch.log(action_probs[0, action]) advantage = target_value - state_value loss_actor = -log_prob * advantage loss = loss_critic + loss_actor self.low_level_optimizer.zero_grad() loss.backward() self.low_level_optimizer.step() def train(self, env, num_episodes): goal_dim = 2 # 目标维度设定为2,直接设定 for episode in range(num_episodes): state, _ = env.reset() # 修改后的reset返回值 goal = self.select_high_level_goal(state, self.epsilon) # 高层选择目标 done = False episode_reward = 0 steps = 0 while not done: steps += 1 action = self.select_low_level_action(state, goal, self.epsilon) # 低层选择动作 next_state, reward, done, _, _ = env.step(action) # 修改后的step返回值 # 更新低层 self.update_low_level(state, goal, action, reward, next_state) # 每隔 HIGH_LEVEL_UPDATE_FREQUENCY 更新一次高层 if steps % HIGH_LEVEL_UPDATE_FREQUENCY == 0: new_goal = self.select_high_level_goal(next_state, self.epsilon) self.update_high_level(state, goal, reward, next_state) goal = new_goal state = next_state episode_reward += reward self.epsilon = max(MIN_EPSILON, self.epsilon * EPSILON_DECAY) print(f"Episode {episode + 1}: Total Reward: {episode_reward}")# 创建 CartPole 环境用于训练,不渲染动画env = gym.make('CartPole-v1') # 不设置 render_modstate_dim = env.observation_space.shape[0]action_dim = env.action_space.ngoal_dim = 2 # 目标维度设为2(简单划分目标)agent = HACAgent(state_dim, action_dim, goal_dim)agent.train(env, NUM_EPISODES)
算法测试代码
# 测试 Hierarchical Actor-Critic 智能体并显示动画def test_hac_agent(agent, env, num_episodes=5): for episode in range(num_episodes): state, _ = env.reset() # 修改后的reset返回值 goal = agent.select_high_level_goal(state, epsilon=0.0) # 高层选择目标 done = False total_reward = 0 env.render() while not done: env.render() action = agent.select_low_level_action(state, goal, epsilon=0.0) # 低层选择动作 next_state, reward, done, _, _ = env.step(action) # 修改后的step返回值 state = next_state total_reward += reward print(f"Test Episode {episode + 1}: Total Reward: {total_reward}") env.close()# 创建 CartPole 环境用于测试,显示动画env = gym.make('CartPole-v1', render_mode='human')test_hac_agent(agent, env)
[Notice] 说明:
- 高层(Manager)网络:高层负责设定子目标,并根据目标进行策略更新。
是从环境中获得的奖励。
(3) 内在奖励机制
内在奖励
用于评估低层策略在时间步(t)的表现。
- 高层Actor-Critic:负责基于当前的状态选择子目标,时间跨度较长(例如10步)。
- 低层策略(Low-Level Policy):负责在更具体的动作空间中执行高层设定的目标,每一步采取实际的动作,学习如何接近高层指定的子目标。
3. HAC的关键公式
HAC采用了标准的Actor-Critic更新方法,同时将目标选择和动作选择分为不同层次。
参考文献:
Levy, Andrew, et al. "Learning Multi-Level Hierarchies with Hindsight." arXiv preprint arXiv:1712.00948 (2017).
🔥想了解更多分层强化学习的文章,请查看文章:
【RL Latest Tech】分层强化学习(Hierarchical RL)
文章若有不当和不正确之处,还望理解与指出。图片等来源于互联网,无法核实真实出处,如涉及相关争议,请联系博主删除。
(1) 高层策略的目标生成
高层策略在时间步(t)选择一个子目标
,并将其交给低层策略。
高层策略更新:高层策略在时间地平线结束后,基于全局环境的奖励信号进行更新。 视频游戏:特别是需要多步策略的游戏,如探索类游戏、在 HAC 中:
- 高层(Manager):负责设定子目标。
训练与测试:
- 训练:在
train
方法中,高层每隔一定步数设定目标,低层则持续执行动作并更新策略。✨