You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
Reinforcement Learning algorithms SARSA, Q-Learning, DQN, for Classical and MuJoCo Environments and testing them with OpenAI Gym.
SARSA Cart Pole
SARSA (State-Action-Reward-State-Action) is a simple on-policy reinforcement learningalgorithm in which the agent tries to learn the optimal policy following the current policy (epsilon-greedy) generating action from current state and also the next state.
Implemented SARSA for the Cart Pole problem, a classical environment provided by OpenAI gym.
Problem Goal: The Cart Pole Problem has 4 states at every time step,
[the position of the cart on the horizontalaxis, the cart’s velocity on that same axis, the pole’s angular position on the cart, the angularvelocity of the pole on the cart]
and there are 2 actions which the cart can take [going to the left,going to the right].
The main goal is to balance the pole on the cart for the longest time takingappropriate actions at every timestep.
Implementation
Discretized the 4 states into [2,2,8,4] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Q-Learning (SARSAMAX) Cart Pole
Q-Learning is a simple off-policy reinforcement learningalgorithm in which the agent tries to learn the optimal policy following the current policy (epsilon-greedy) generating action from current state and transitions to the state using the action which has the max Q-value, which is the why it is also called SARSAMAX.
Implemented Q-learning for the Cart Pole problem, a classical environment provided by OpenAI gym.
Implementation
Discretized the 4 states into [2,2,8,4] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Results: It looks from the graph that, the cart is able to balance the pole for the required amount of time almost constantly in about less than 2000 episodes for both algorithms.
SARSA Mountain Car
Problem Goal: The Mountain Car Problem has 2 states at every time step,
[the position of the car, the car’s velocity]
and there are 3 actions which the cart can take [going to the left,no action,going to the right].
The main goal is to make the car reach the goal(up-hill) takingappropriate actions at every timestep.
Implementation
Discretized the 2 states into [20,20] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Used gradually increasing learning rate because as the exploration rate decreases, confidence level increases and more learning happens towards the end of the episodes.
Q-Learning (SARSAMAX) Mountain Car
Implementation
Discretized the 2 states into [20,20] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Used gradually increasing learning rate because as the exploration rate decreases, confidence level increases and more learning happens towards the end of the episodes.
Results: It looks from the graph that, the car is able to reach the goal almost constantly in about less than 3000 episodes for both algorithms.
SARSA Mountain Car with Backward View (Eligibility Traces)
Implementation
Discretized the 2 states into [65,65] discrete states respectively and have maintained a specific range of values for each of the states.
Used Eligiblity Traces, tuning value for lambda.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Used gradually increasing learning rate because as the exploration rate decreases, confidence level increases and more learning happens towards the end of the episodes.
Deep Q-Learning Cart Pole
Implementation
Created 2 Deep Networks which takes state as input and outputs the Target Q-Value for respective actions in respective networks.
Used Experience Replay which does off-line updates sampling batches from memory, and trains the network reducing the MSE.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Explicitly for this problem, gave the reard of -10 if the done bool value is True and the timesteps is not 200. So the agent tries to avoid the bool value and keeps on balancing the pole on the cart.
About
Reinforcement Learning algorithms SARSA, Q-Learning, DQN, for Classical and MuJoCo Environments and testing them with OpenAI Gym.