Reinforcement Learning Roadmap
Written on Aug 29, 2025
My journey into understanding machine learning began with a 12 week learning syllabus that chatGPT prepared for me. That lead me to the Coursera Deep Learning Specialization which I just completed. The quality of Coursera specialization was excellent and I have a good understanding of the foundations of deep learning now. What began as a 12 week initiative to get a bit of understanding of machine learning has led me down a deep rabbit hole.
Somewhere along the line in looking at all the various branches of machine learning I stumbled across reinforcement learning. The functionality in todays LLMs is truly amazing and you can do great things with these tools but they're not the holy grail of advanced general intelligence (AGI). Reinforcement learning appeals to me because it is somewhat modeled on how humans learn, it can be run with far less computing resources and it aims to tackle AGI. Some of the world leading research comes out of Richard Sutton's group's work at the University of Alberta. The Alberta Plan lays out a plan to advance the field over the next 5-10 years which lines up well with my intentions to embark on a doctoral degree. To that end I'm going to dive deeper into reinforcement learning as a potential research topic.
In a happy coincidence University of Alberta a Coursera specialization on Reinforcement Learning that I can take. My goal over the next 4 months is to complete the Hugging Face Deep Reinforcement Learning Course , the University of Alberta Reinforcement Learning Specialization and publish my work publicly on this blog and in the Open Science Framework .
4-Month RL + Hugging Face + OSF Roadmap (~10 hrs/week)
Cadence: ~10 hrs/week × 16 weeks
Stack: Python, NumPy, PyTorch, Gymnasium, Stable-Baselines3, Hugging Face Hub, Weights & Biases, GitLab CI/CD, Docker, OSF.io
Table of Contents
Phase 1: Foundations + OSF Setup (Weeks 1–4)
Week
UAlberta / S&B Tasks
HF Units
OSF Deliverables
Hours
1
C1 M1–M3: Bandits, MDPs; S&B Ch.1–3
U0: Intro, setup
OSF Project + README.md + License
7+3
2
C1 M4–M5: DP (Policy/Value Iteration); S&B Ch.4
U1: Q-Learning basics
OSF: Initial notebooks + figures
7+3
3
C2 M1–M3: MC + TD(0); S&B Ch.5–6
U2: Q-Learning labs
OSF: Bandits vs MDPs report
7+3
4
n-step TD; S&B Ch.7; Random Walk experiments
(Optional HF review)
OSF: DP & n-step TD plots
8+2
Phase 2: Control & Approximation (Weeks 5–8)
Week
UAlberta / S&B Tasks
HF Units
OSF Deliverables
Hours
5
C2 M4–M5: SARSA, Q-Learning, Dyna-Q; Ch.6, Ch.8
U1 labs: CartPole Q-Learn
OSF: Baselines table + ε-decay plots
6+4
6
C3 M1–M3: Func. Approx I: Tile coding, semi-grad TD; Ch.9
U2: DQN intro, envs
OSF: Features notebook, sweep plots
6+4
7
C3 M4: Func. Approx II: Control; Ch.10
U3: DQN Atari hands-on
OSF: Semi-grad SARSA ablation plots
6+4
8
Off-Policy + Eligibility Traces; Ch.11–12
(Optional HF review)
OSF: λ vs perf. report + configs
7+3
Phase 3: Capstone + Pre-Registration (Weeks 9–12)
Week
UAlberta / S&B Tasks
HF Units
OSF Deliverables
Hours
9
C4 M1–M4: Capstone Design (env, metrics, methods)
U3: Advanced DQN
OSF: Pre-registration (methods, metrics)
6+4
10
C4 M5: Capstone Build: training runs, ≥3 seeds, dashboards
U4: Policy Gradients
OSF: Code + Dockerfile upload
6+4
11
C4 M6: Capstone Analysis: HP sweeps, ablations, error bars
U4: PPO intro
OSF: Methodology report + plots
6+4
12
Capstone Final: results, slides, v1.0 repo tag
U5: PPO/LunarLander
OSF: v1.0 reproducibility package
6+4
Phase 4: Deep RL + Paper Publication (Weeks 13–16)
Week
UAlberta / S&B Tasks
HF Units
OSF Deliverables
Hours
13
DQN + Atari variants (Double, Dueling)
U2/U3 review
OSF: Baseline table (DQN vs variants)
5+5
14
C3 M5: Policy Gradients → PPO experiments
U4/U8 PPO hands-on
OSF: PPO vs PG plots + sample efficiency
5+5
15
SAC / TD3 on continuous control tasks
U5/U6: Unity + A2C
OSF: Off-policy DRL results, configs
5+5
16
Paper writing, final upload, OSF DOI release
U7/U8: Multi-agent
OSF: Final paper PDF + blog post + DOI
5+5
Coverage Maps
Coursera Module Coverage (100%)
Course / Module
Scheduled Week(s)
C1. Fundamentals of RL
M1 Welcome
W1 (quick pass)
M2 Intro to Sequential Decision-Making (bandits)
W1
M3 Markov Decision Processes
W1
M4 Value Functions & Bellman Equations
W1–W2
M5 Dynamic Programming
W2
C2. Sample-based Learning Methods
M1 Welcome
W3 (quick pass)
M2 Monte Carlo (pred & control)
W3 (Blackjack MC)
M3 TD for Prediction
W3 (TD(0) on FrozenLake)
M4 TD for Control (SARSA, Q-Learning)
W5 (CartPole SARSA/QL)
M5 Planning, Learning & Acting (Dyna)
W5 (Dyna-Q maze task)
C3. Prediction & Control w/ Function Approximation
M1 Welcome
W6 (quick pass)
M2 On-policy Prediction w/ Approx
W6 (semi-grad TD)
M3 Constructing Features
W6–W7 (tile coding / NN features)
M4 Control w/ Approx
W7 (semi-grad SARSA control)
M5 Policy Gradient
W14 (REINFORCE → PPO)
C4. Capstone
M1 Welcome
W9 (kickoff)
M2 Formalize Word Problem as MDP
W9 (Capstone design doc)
M3 Choosing the Right Algorithm
W9 (algo selection)
M4 Identify Key Performance Parameters
W9–W10 (HParam grid design)
M5 Implement Your Agent
W10 (Capstone build & test)
M6 Submit Your Parameter Study!
W11–W12 (runs + plots + analysis)
Hugging Face Course Coverage (100%)
Hugging Face Unit
Week(s) Mapped
Notes
Unit 0: Welcome to the Course
W1 (quick pass)
Course intro + environment setup
Unit 1: Introduction to Deep RL
W2
LunarLander hands-on, model upload
Bonus Unit 1: RL with Huggy the Doggo
W2 (optional)
Fun supplementary content
Unit 2: Introduction to Q-Learning
W3
FrozenLake, Taxi implementations
Unit 3: Deep Q-Learning with Atari Games
W6–W7
Space Invaders DQN via SB3 Zoo
Bonus Unit 2: Hyperparameter Tuning with Optuna
W7 (optional)
Hyperparameter tuning add-on
Unit 4: Policy Gradient with PyTorch
W14
REINFORCE on CartPole / PixelCopter
Unit 5: Introduction to Unity ML-Agents
W15
Unity envs (Snowball, Pyramid)
Unit 6: Actor-Critic Methods with Robotics Envs
W15 (optional)
Robotics or PyBullet experiments
Unit 7: Multi-Agent / AI vs AI
W16 (optional)
Multi-agent training & challenges
Unit 8 Part 1: PPO (Theory)
W14
Matches PPO experiments in RL roadmap
Unit 8 Part 2: PPO with Doom
W15 (optional)
PPO in VizDoom environment
Bonus Unit 3: Advanced Topics in RL
W16 (optional)
Advanced theory extensions
Bonus Unit 5: Imitation Learning with Godot
W16 (optional)
Imitation learning tasks in Godot
OSF Integration Points
Week 1: Project setup (README, License, roadmap).
Week 9: Pre-registration (methods, hypotheses, metrics).
Weeks 10–12: Upload code, Docker, configs, results CSVs.
Week 16: Publish final paper, mint DOI, link GitHub repo + blog.
Outcomes at Week 16
OSF Project: Code, data, plots, Docker images, paper (PDF).
GitHub Repo: v1.0 (Capstone) + v1.1 (Deep RL) tags, CI/CD pipelines.
Blog Post: Summary, lessons learned, links to OSF & GitHub.
Portfolio: Publicly citable, reproducible RL study.
Resources