Is this a game... or is it real?

Reinforcement Learning Roadmap

Written on

My journey into understanding machine learning began with a 12 week learning syllabus that chatGPT prepared for me. That lead me to the Coursera Deep Learning Specialization which I just completed. The quality of Coursera specialization was excellent and I have a good understanding of the foundations of deep learning now. What began as a 12 week initiative to get a bit of understanding of machine learning has led me down a deep rabbit hole.

Somewhere along the line in looking at all the various branches of machine learning I stumbled across reinforcement learning. The functionality in todays LLMs is truly amazing and you can do great things with these tools but they're not the holy grail of advanced general intelligence (AGI). Reinforcement learning appeals to me because it is somewhat modeled on how humans learn, it can be run with far less computing resources and it aims to tackle AGI. Some of the world leading research comes out of Richard Sutton's group's work at the University of Alberta. The Alberta Plan lays out a plan to advance the field over the next 5-10 years which lines up well with my intentions to embark on a doctoral degree. To that end I'm going to dive deeper into reinforcement learning as a potential research topic.

In a happy coincidence University of Alberta a Coursera specialization on Reinforcement Learning that I can take. My goal over the next 4 months is to complete the Hugging Face Deep Reinforcement Learning Course, the University of Alberta Reinforcement Learning Specialization and publish my work publicly on this blog and in the Open Science Framework.

4-Month RL + Hugging Face + OSF Roadmap (~10 hrs/week)

Cadence: ~10 hrs/week × 16 weeks
Stack: Python, NumPy, PyTorch, Gymnasium, Stable-Baselines3, Hugging Face Hub, Weights & Biases, GitLab CI/CD, Docker, OSF.io


Table of Contents


Phase 1: Foundations + OSF Setup (Weeks 1–4)

Week UAlberta / S&B Tasks HF Units OSF Deliverables Hours
1 C1 M1–M3: Bandits, MDPs; S&B Ch.1–3 U0: Intro, setup OSF Project + README.md + License 7+3
2 C1 M4–M5: DP (Policy/Value Iteration); S&B Ch.4 U1: Q-Learning basics OSF: Initial notebooks + figures 7+3
3 C2 M1–M3: MC + TD(0); S&B Ch.5–6 U2: Q-Learning labs OSF: Bandits vs MDPs report 7+3
4 n-step TD; S&B Ch.7; Random Walk experiments (Optional HF review) OSF: DP & n-step TD plots 8+2

Phase 2: Control & Approximation (Weeks 5–8)

Week UAlberta / S&B Tasks HF Units OSF Deliverables Hours
5 C2 M4–M5: SARSA, Q-Learning, Dyna-Q; Ch.6, Ch.8 U1 labs: CartPole Q-Learn OSF: Baselines table + ε-decay plots 6+4
6 C3 M1–M3: Func. Approx I: Tile coding, semi-grad TD; Ch.9 U2: DQN intro, envs OSF: Features notebook, sweep plots 6+4
7 C3 M4: Func. Approx II: Control; Ch.10 U3: DQN Atari hands-on OSF: Semi-grad SARSA ablation plots 6+4
8 Off-Policy + Eligibility Traces; Ch.11–12 (Optional HF review) OSF: λ vs perf. report + configs 7+3

Phase 3: Capstone + Pre-Registration (Weeks 9–12)

Week UAlberta / S&B Tasks HF Units OSF Deliverables Hours
9 C4 M1–M4: Capstone Design (env, metrics, methods) U3: Advanced DQN OSF: Pre-registration (methods, metrics) 6+4
10 C4 M5: Capstone Build: training runs, ≥3 seeds, dashboards U4: Policy Gradients OSF: Code + Dockerfile upload 6+4
11 C4 M6: Capstone Analysis: HP sweeps, ablations, error bars U4: PPO intro OSF: Methodology report + plots 6+4
12 Capstone Final: results, slides, v1.0 repo tag U5: PPO/LunarLander OSF: v1.0 reproducibility package 6+4

Phase 4: Deep RL + Paper Publication (Weeks 13–16)

Week UAlberta / S&B Tasks HF Units OSF Deliverables Hours
13 DQN + Atari variants (Double, Dueling) U2/U3 review OSF: Baseline table (DQN vs variants) 5+5
14 C3 M5: Policy Gradients → PPO experiments U4/U8 PPO hands-on OSF: PPO vs PG plots + sample efficiency 5+5
15 SAC / TD3 on continuous control tasks U5/U6: Unity + A2C OSF: Off-policy DRL results, configs 5+5
16 Paper writing, final upload, OSF DOI release U7/U8: Multi-agent OSF: Final paper PDF + blog post + DOI 5+5

Coverage Maps

Coursera Module Coverage (100%)

Course / Module Scheduled Week(s)
C1. Fundamentals of RL
M1 Welcome W1 (quick pass)
M2 Intro to Sequential Decision-Making (bandits) W1
M3 Markov Decision Processes W1
M4 Value Functions & Bellman Equations W1–W2
M5 Dynamic Programming W2
C2. Sample-based Learning Methods
M1 Welcome W3 (quick pass)
M2 Monte Carlo (pred & control) W3 (Blackjack MC)
M3 TD for Prediction W3 (TD(0) on FrozenLake)
M4 TD for Control (SARSA, Q-Learning) W5 (CartPole SARSA/QL)
M5 Planning, Learning & Acting (Dyna) W5 (Dyna-Q maze task)
C3. Prediction & Control w/ Function Approximation
M1 Welcome W6 (quick pass)
M2 On-policy Prediction w/ Approx W6 (semi-grad TD)
M3 Constructing Features W6–W7 (tile coding / NN features)
M4 Control w/ Approx W7 (semi-grad SARSA control)
M5 Policy Gradient W14 (REINFORCE → PPO)
C4. Capstone
M1 Welcome W9 (kickoff)
M2 Formalize Word Problem as MDP W9 (Capstone design doc)
M3 Choosing the Right Algorithm W9 (algo selection)
M4 Identify Key Performance Parameters W9–W10 (HParam grid design)
M5 Implement Your Agent W10 (Capstone build & test)
M6 Submit Your Parameter Study! W11–W12 (runs + plots + analysis)

Hugging Face Course Coverage (100%)

Hugging Face Unit Week(s) Mapped Notes
Unit 0: Welcome to the Course W1 (quick pass) Course intro + environment setup
Unit 1: Introduction to Deep RL W2 LunarLander hands-on, model upload
Bonus Unit 1: RL with Huggy the Doggo W2 (optional) Fun supplementary content
Unit 2: Introduction to Q-Learning W3 FrozenLake, Taxi implementations
Unit 3: Deep Q-Learning with Atari Games W6–W7 Space Invaders DQN via SB3 Zoo
Bonus Unit 2: Hyperparameter Tuning with Optuna W7 (optional) Hyperparameter tuning add-on
Unit 4: Policy Gradient with PyTorch W14 REINFORCE on CartPole / PixelCopter
Unit 5: Introduction to Unity ML-Agents W15 Unity envs (Snowball, Pyramid)
Unit 6: Actor-Critic Methods with Robotics Envs W15 (optional) Robotics or PyBullet experiments
Unit 7: Multi-Agent / AI vs AI W16 (optional) Multi-agent training & challenges
Unit 8 Part 1: PPO (Theory) W14 Matches PPO experiments in RL roadmap
Unit 8 Part 2: PPO with Doom W15 (optional) PPO in VizDoom environment
Bonus Unit 3: Advanced Topics in RL W16 (optional) Advanced theory extensions
Bonus Unit 5: Imitation Learning with Godot W16 (optional) Imitation learning tasks in Godot

OSF Integration Points

  • Week 1: Project setup (README, License, roadmap).
  • Week 9: Pre-registration (methods, hypotheses, metrics).
  • Weeks 10–12: Upload code, Docker, configs, results CSVs.
  • Week 16: Publish final paper, mint DOI, link GitHub repo + blog.

Outcomes at Week 16

  • OSF Project: Code, data, plots, Docker images, paper (PDF).
  • GitHub Repo: v1.0 (Capstone) + v1.1 (Deep RL) tags, CI/CD pipelines.
  • Blog Post: Summary, lessons learned, links to OSF & GitHub.
  • Portfolio: Publicly citable, reproducible RL study.

Resources