📜  SARSA 强化学习

📅  最后修改于: 2022-05-13 01:55:44.781000             🧑  作者: Mango

SARSA 强化学习

先决条件: Q-Learning技术
SARSA 算法是流行的 Q-Learning 算法的轻微变体。对于任何强化学习算法中的学习代理,它的策略可以有两种类型:-

  1. On Policy:在此,学习代理根据从当前使用的策略派生的当前动作来学习价值函数。
  2. Off Policy:在这种情况下,学习代理根据从另一个策略派生的动作来学习价值函数。

Q-Learning 技术是一种Off Policy技术,使用贪婪方法来学习 Q 值。另一方面,SARSA 技术是一种On Policy ,它使用当前策略执行的操作来学习 Q 值。
这种差异在每种技术的更新语句的差异中是可见的:-

  1. Q-学习: Q(s_{t},a_{t}) = Q(s_{t},a_{t}) + \alpha (r_{t+1}+\gamma max_{a}Q(s_{t+1},a)-Q(s_{t},a_{t}))
  2. 沙萨: Q(s_{t},a_{t}) = Q(s_{t},a_{t}) + \alpha (r_{t+1}+\gamma Q(s_{t+1},a_{t+1})-Q(s_{t},a_{t}))

这里,SARSA 的更新方程取决于当前状态、当前动作、获得的奖励、下一个状态和下一个动作。这一观察导致学习技术的命名为 SARSA 代表状态动作奖励状态动作,它象征着元组 (s, a, r, s', a')。
以下Python代码演示了如何使用 OpenAI 的gym 模块加载环境来实现 SARSA 算法
第 1 步:导入所需的库

Python3
import numpy as np
import gym


Python3
#Building the environment
env = gym.make('FrozenLake-v0')


Python3
#Defining the different parameters
epsilon = 0.9
total_episodes = 10000
max_steps = 100
alpha = 0.85
gamma = 0.95
 
#Initializing the Q-matrix
Q = np.zeros((env.observation_space.n, env.action_space.n))


Python3
#Function to choose the next action
def choose_action(state):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action
 
#Function to learn the Q-value
def update(state, state2, reward, action, action2):
    predict = Q[state, action]
    target = reward + gamma * Q[state2, action2]
    Q[state, action] = Q[state, action] + alpha * (target - predict)


Python3
#Initializing the reward
reward=0
 
# Starting the SARSA learning
for episode in range(total_episodes):
    t = 0
    state1 = env.reset()
    action1 = choose_action(state1)
 
    while t < max_steps:
        #Visualizing the training
        env.render()
         
        #Getting the next state
        state2, reward, done, info = env.step(action1)
 
        #Choosing the next action
        action2 = choose_action(state2)
         
        #Learning the Q-value
        update(state1, state2, reward, action1, action2)
 
        state1 = state2
        action1 = action2
         
        #Updating the respective vaLues
        t += 1
        reward += 1
         
        #If at the end of learning process
        if done:
            break


Python3
#Evaluating the performance
print ("Performance : ", reward/total_episodes)
 
#Visualizing the Q-matrix
print(Q)


第 2 步:构建环境
在这里,我们将使用预加载到健身房的“FrozenLake-v0”环境。您可以在此处阅读有关环境描述的信息。

Python3

#Building the environment
env = gym.make('FrozenLake-v0')

第三步:初始化不同的参数

Python3

#Defining the different parameters
epsilon = 0.9
total_episodes = 10000
max_steps = 100
alpha = 0.85
gamma = 0.95
 
#Initializing the Q-matrix
Q = np.zeros((env.observation_space.n, env.action_space.n))

第 4 步:定义要在学习过程中使用的效用函数

Python3

#Function to choose the next action
def choose_action(state):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action
 
#Function to learn the Q-value
def update(state, state2, reward, action, action2):
    predict = Q[state, action]
    target = reward + gamma * Q[state2, action2]
    Q[state, action] = Q[state, action] + alpha * (target - predict)

第 5 步:训练学习代理

Python3

#Initializing the reward
reward=0
 
# Starting the SARSA learning
for episode in range(total_episodes):
    t = 0
    state1 = env.reset()
    action1 = choose_action(state1)
 
    while t < max_steps:
        #Visualizing the training
        env.render()
         
        #Getting the next state
        state2, reward, done, info = env.step(action1)
 
        #Choosing the next action
        action2 = choose_action(state2)
         
        #Learning the Q-value
        update(state1, state2, reward, action1, action2)
 
        state1 = state2
        action1 = action2
         
        #Updating the respective vaLues
        t += 1
        reward += 1
         
        #If at the end of learning process
        if done:
            break

在上面的输出中,红色标记确定了代理在环境中的当前位置,而括号中给出的方向给出了代理接下来将进行的移动方向。请注意,如果超出范围,代理将停留在其位置。
第 6 步:评估性能

Python3

#Evaluating the performance
print ("Performance : ", reward/total_episodes)
 
#Visualizing the Q-matrix
print(Q)