策略梯度玩 cartpole 游戏,强化学习代替PID算法控制平衡杆

cartpole,pid · 浏览次数 : 0

小编点评

## Policy Network and Game Play This code demonstrates the implementation of a simple policy network for balancing a pole in a cart pole environment using PyTorch. **Key Concepts:** * **Policy Network:** This is a neural network that takes the current state of the environment (including position and velocity of the pole) as input and outputs the probability of taking each possible action (left or right move). * **State Space:** The state space represents the current position and velocity of the pole. * **Actions:** The possible actions are left and right movement. * **Reward Function:** This function calculates the reward for a given trajectory, based on the final position of the pole. **Training:** * The policy network is trained using a gradient-based approach. * The loss function calculates the difference between the actual reward and the predicted reward based on the network's output. * The network is updated by iterating over the training trajectories and calculating the gradient of the loss function with respect to the network's parameters. * The optimizer updates the network's weights using the calculated gradient. **Playing the Game:** * The trained policy network is loaded from a saved file. * The game environment is initialized and the initial state of the pole is set. * The agent starts with no knowledge of the environment and moves the pole according to the policy network's predictions. * The game continues until the pole reaches the end of the track or a certain number of steps are reached. **Output:** * The program displays a series of frames showing the environment and the pole's movement. * It also shows the final position of the pole after each game step. * The program also displays the number of steps taken to reach the end of the track. **Limitations:** * The policy network is relatively simple and may not be able to handle complex or dynamic environments. * The training process can be unstable and may require several iterations to converge. * The performance of the network can be affected by factors such as the environment configuration and the initial state of the pole. **Overall, this code demonstrates a basic implementation of a policy network for balancing a pole in a cart pole environment using PyTorch. While the network is simple, it showcases the core concepts and principles of reinforcement learning.**

正文

 

cartpole游戏,车上顶着一个自由摆动的杆子,实现杆子的平衡,杆子每次倒向一端车就开始移动让杆子保持动态直立的状态,策略函数使用一个两层的简单神经网络,输入状态有4个,车位置,车速度,杆角度,杆速度,输出action为左移动或右移动,输入状态发现至少要给3个才能稳定一会儿,给2个完全学不明白,给4个能学到很稳定的policy

 

 

策略梯度实现代码,使用torch实现一个简单的神经网络

import gym
import torch
import torch.nn as nn
import torch.optim as optim
import pygame
import sys
from collections import deque
import numpy as np

# 策略网络定义
class PolicyNetwork(nn.Module):
    def __init__(self):
        super(PolicyNetwork, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 10),  # 4个状态输入,128个隐藏单元
            nn.Tanh(),
            nn.Linear(10, 2),  # 输出2个动作的概率
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        # print(x)  车位置 车速度 杆角度 杆速度
        selected_values = x[:, [0,1,2,3]]  #只使用车位置和杆角度
        return self.fc(selected_values)

# 训练函数
def train(policy_net, optimizer, trajectories):
    policy_net.zero_grad()
    loss = 0
    print(trajectories[0])
    for trajectory in trajectories:
        
        # if trajectory["returns"] > 90:
        # returns = torch.tensor(trajectory["returns"]).float()
        # else:
        returns = torch.tensor(trajectory["returns"]).float() - torch.tensor(trajectory["step_mean_reward"]).float()
        # print(f"获得奖励{returns}")
        log_probs = trajectory["log_prob"]
        loss += -(log_probs * returns).sum()  # 计算策略梯度损失
    loss.backward()
    optimizer.step()
    return loss.item()

# 主函数
def main():
    env = gym.make('CartPole-v1')
    policy_net = PolicyNetwork()
    optimizer = optim.Adam(policy_net.parameters(), lr=0.01)

    print(env.action_space)
    print(env.observation_space)
    pygame.init()
    screen = pygame.display.set_mode((600, 400))
    clock = pygame.time.Clock()

    rewards_one_episode= []
    for episode in range(10000):
        
        state = env.reset()
        done = False
        trajectories = []
        state = state[0]
        step = 0
        torch.save(policy_net, 'policy_net_full.pth')
        while not done:
            state_tensor = torch.tensor(state).float().unsqueeze(0)
            probs = policy_net(state_tensor)
            action = torch.distributions.Categorical(probs).sample().item()
            log_prob = torch.log(probs.squeeze(0)[action])
            next_state, reward, done, _,_ = env.step(action)

            # print(episode)
            trajectories.append({"state": state, "action": action, "reward": reward, "log_prob": log_prob})
            state = next_state

            for event in pygame.event.get():
                if event.type == pygame.QUIT:
                    pygame.quit()
                    sys.exit()
            step +=1
            
            # 绘制环境状态
            if rewards_one_episode and rewards_one_episode[-1] >99:
                screen.fill((255, 255, 255))
                cart_x = int(state[0] * 100 + 300)
                pygame.draw.rect(screen, (0, 0, 255), (cart_x, 300, 50, 30))
                # print(state)
                pygame.draw.line(screen, (255, 0, 0), (cart_x + 25, 300), (cart_x + 25 - int(50 * torch.sin(torch.tensor(state[2]))), 300 - int(50 * torch.cos(torch.tensor(state[2])))), 2)
                pygame.display.flip()
                clock.tick(200)
                

        print(f"第{episode}回合",f"运行{step}步后挂了")
        # 为策略梯度计算累积回报
        returns = 0
        
        
        for traj in reversed(trajectories):
            returns = traj["reward"] + 0.99 * returns
            traj["returns"] = returns
            if rewards_one_episode:
                # print(rewards_one_episode[:10])
                traj["step_mean_reward"] = np.mean(rewards_one_episode[-10:])
            else:
                traj["step_mean_reward"] = 0
        rewards_one_episode.append(returns)
        # print(rewards_one_episode[:10])
        train(policy_net, optimizer, trajectories)

def play():

    env = gym.make('CartPole-v1')
    policy_net = PolicyNetwork()
    pygame.init()
    screen = pygame.display.set_mode((600, 400))
    clock = pygame.time.Clock()

    state = env.reset()
    done = False
    trajectories = deque()
    state = state[0]
    step = 0
    policy_net = torch.load('policy_net_full.pth')
    while not done:
        state_tensor = torch.tensor(state).float().unsqueeze(0)
        probs = policy_net(state_tensor)
        action = torch.distributions.Categorical(probs).sample().item()
        log_prob = torch.log(probs.squeeze(0)[action])
        next_state, reward, done, _,_ = env.step(action)

        # print(episode)
        trajectories.append({"state": state, "action": action, "reward": reward, "log_prob": log_prob})
        state = next_state

        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                pygame.quit()
                sys.exit()

        
        # 绘制环境状态
        screen.fill((255, 255, 255))
        cart_x = int(state[0] * 100 + 300)
        pygame.draw.rect(screen, (0, 0, 255), (cart_x, 300, 50, 30))
        # print(state)
        pygame.draw.line(screen, (255, 0, 0), (cart_x + 25, 300), (cart_x + 25 - int(50 * torch.sin(torch.tensor(state[2]))), 300 - int(50 * torch.cos(torch.tensor(state[2])))), 2)
        pygame.display.flip()
        clock.tick(60)
        step +=1

    print(f"运行{step}步后挂了")



if __name__ == '__main__':
    main() #训练
    # play() #推理

  运行效果,训练过程不是很稳定,有时候学很多轮次也学不明白,有时侯只需要几十次就可以学明白了

 

与策略梯度玩 cartpole 游戏,强化学习代替PID算法控制平衡杆相似的内容:

策略梯度玩 cartpole 游戏,强化学习代替PID算法控制平衡杆

cartpole游戏,车上顶着一个自由摆动的杆子,实现杆子的平衡,杆子每次倒向一端车就开始移动让杆子保持动态直立的状态,策略函数使用一个两层的简单神经网络,输入状态有4个,车位置,车速度,杆角度,杆速度,输出action为左移动或右移动,输入状态发现至少要给3个才能稳定一会儿,给2个完全学不明白,给

实践讲解强化学习之梯度策略、添加基线、优势函数、动作分配合适的分数

摘要:本文将从实践案例角度为大家解读强化学习中的梯度策略、添加基线(baseline)、优势函数、动作分配合适的分数(credit)。 本文分享自华为云社区《强化学习从基础到进阶-案例与实践[5]:梯度策略、添加基线(baseline)、优势函数、动作分配合适的分数(credit)》,作者: 汀丶。

使用策略模式优化你的代码

策略模式简介 策略模式(Strategy Pattern:Define a family of algorithms,encapsulate each one,and make them interchangeable.)中文解释为:定义一组算法,然后将这些算法封装起来,以便它们之间可以互换,属于一

软件设计模式系列之二十三——策略模式

策略模式(Strategy Pattern)是一种行为型设计模式,它允许在运行时动态选择算法的行为。这意味着你可以定义一系列算法,将它们封装成独立的策略对象,然后根据需要在不修改客户端代码的情况下切换这些算法。策略模式有助于解决问题领域中不同行为的变化和扩展,同时保持代码的灵活性和可维护性。

用策略模式干掉代码里大量的if-eles或则Swatch,提升B格由面向过程转为面向对象

现象 大量的分支选择型代码段看着让人头疼 for (Field field : declaredFields) { Class type = field.getType(); String key = field.getName(); Element result = resultMap.ad

Redis的三种持久化策略及选取建议

Redis三种不同的持久化策略:RDB(快照)、AOF(追加文件)、混合。这三种策略各有优缺点,需要根据不同的场景和需求进行选择和配置。本文将介绍这三种策略、选取建议及常见问题的解决方案

缓存更新的四种策略及选取建议

缓存的四种更新策略,Cache Aside、Read/Write Through 、Write Behind Caching、Refresh-Ahead,本文将介绍这四种策略及如何选择正确的策略

(工厂+策略)实现登录功能

原始代码 业务层UserService @Service public class UserService { public LoginResp login(LoginReq loginReq){ if(loginReq.getType().equals("account")){ System.ou

【设计模式】策略模式

一、介绍 策略模式是一种行为设计模式,它能让你定义一系列算法,并将每种算法分别放入独立的类中,以使算法的对象能够相互替换。 这里列举两个例子来说明下策略模式的使用场景: (1)根据会员等级来计算折扣力度。不同等级拥有不同的折扣力度,这样就可以根据策略模式去灵活的计算,就算之后又新增了几个等级的会员,

[转帖]零信任策略下K8s安全监控最佳实践(K+)

https://developer.aliyun.com/article/1009607?spm=a2c6h.24874632.expert-profile.126.3b0b506fysVD76 简介: 本文重点将围绕监控防护展开,逐层递进地介绍如何在复杂的分布式容器化环境中借助可观测性平台,持续监