Is this a game... or is it real?

D.Eng. Preparation - 2 Month Plan

Written on

I've completed the University of Alberta Reinforcement Learning Specialization which involved reading most of the Reinforcement Learning: An Introduction. In my original and updated RL learning roadmaps I was striving to actually start implementing something. I did get a start on some Pytorch implementations of basic algorithms but ended up focusing on the coursera course and the textbook. Having completed that it's pretty apparent to me that I need to do a deeper dive in to a refresh of the math. It's been 22 years since I graduated comp sci and I haven't been practicing math at all. It makes a lot more sense to focus on that weakness and starting a literature review than diving deeper in to implementations right now.

Rather than launching into a capstone project, the next 8 weeks will focus on refreshing my mathematical foundations and beginning my literature review with Sutton's Reading list for Rich's RL approach to AI. This should position me well for the D.Eng. program starting in January 2026.

Overview

Duration: 8 weeks (November 2025 - December 2025)
Total Time Commitment: ~11-12 hours/week
Math Track: DeepLearning.AI Mathematics for Machine Learning and Data Science (3-course specialization, ~93 hours)
Literature Track: Sutton's 10 foundational papers spanning 35 years of RL research


Table of Contents


8-Week Curriculum

Week-by-Week Breakdown

Week Math Course Focus Area Sutton's Readings
1 Linear Algebra M1-M2 (10 hrs) Foundations: vectors, matrices, value functions [1] S&B textbook (review) + [2] TD learning (Sutton 1988)
2 Linear Algebra M3-M4 (11 hrs) Linear transformations, temporal abstraction [3] IDBD/gradient descent (Sutton 1992) + [4] Options/temporal abstraction (Sutton et al 1999)
3 Linear Algebra M5 (11 hrs) Eigenvalues, dimensionality reduction, GVFs [5] Horde/GVF architecture (Sutton et al 2011) + [6] Alberta Plan (Sutton et al 2023)
4 Calculus M1-M2 (13 hrs) Derivatives, gradients, optimization fundamentals [10] Dyna-style planning (Sutton et al 2008)
5 Calculus M3 (13 hrs) Gradient descent in neural networks, modern perspective [8] Gaps in planning video (Sutton 2021) + [9] Model-based RL (Sutton 2020)
6 Probability & Statistics M1-M2 (11 hrs) Distributions, hypothesis testing, reward estimation [7] Reward-respecting subtasks (Sutton et al 2023)
7-8 Probability & Statistics M3-M5 (22 hrs) Confidence intervals, Bayesian methods, comprehensive synthesis Synthesis: Integrate all 10 readings into unified research framework

Total: ~93 hours math courses + 20-25 hours paper reading = 113-118 hours over 8 weeks


Sutton's 10 Foundational Readings

This reading list comes directly from Sutton's recommendations for understanding his RL approach to AI (Oct 2023). The list below follows his recommended order and includes his commentary on why each reading matters.

[1] Reinforcement Learning: An Introduction

  • Sutton & Barto (2018), 2nd Edition
  • Status: You've already read most of this; may review chapters as needed
  • Why it matters (Sutton's perspective): "This is the best starting point and reference. It is also the best reference for deeply understanding the reasons behind the core algorithms."
  • Key concepts: Discrete states, tabular agents, function approximation, basic planning (Chapter 8)
  • Review during: Weeks 1-8 as needed

[2] Learning to Predict by the Methods of Temporal Differences

  • Sutton (1988)
  • Why it matters: Introduces TD(λ), "the one algorithm that I feel we all must understand completely"
  • Sutton's note: "Of course TD(λ) is also presented in [1], but it is spread across Chapters 6 and 12, and besides, sometimes going back in time is a fun way to learn"
  • Related resource: Sutton provides a video on TD learning (included in that document)
  • Read during: Week 1; establish foundational understanding of temporal difference learning

[3] Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta

  • Sutton (1992)
  • Why it matters: "The key IDBD paper on step-size optimization" - essential content not in the RL textbook
  • Significance: Precursor to modern adaptive learning rate methods
  • Read during: Week 2; pair with understanding of gradient descent

[4] Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning

  • Sutton, Precup, & Singh (1999)
  • Why it matters: "The options paper on temporal abstraction" - essential content not in the RL textbook
  • Foundation for: Hierarchical RL, multi-timescale decision-making, GVFs
  • Read during: Week 2; connects to linear transformations and eigenvalue concepts

[5] Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction

  • Sutton, Modayil, Delp, Degris, Pilarski, White, & Precup (2011)
  • Why it matters: "The best first paper on GVFs" - essential content not in the RL textbook
  • Significance: Demonstrates GVF architecture applied to real robotics
  • Key concept: Generalized Value Functions as the core architecture
  • Read during: Week 3; after learning about eigenvalues and dimensionality reduction

[6] The Alberta Plan for AI Research

  • Sutton, Bowling, & Pilarski (2023)
  • Why it matters: "After [1-5] this plan should make sense to you"
  • Scope: 50-year research roadmap unifying all previous work toward intelligent agents
  • Relevance to you: Step 12 (Intelligence Amplification) directly addresses capstone direction
  • Sutton's context: "The one missing bit from [1-5] is advanced planning"
  • Read during: Week 3 (initial read) and Weeks 7-8 (deep re-read with full math foundation)

[7] Reward-Respecting Subtasks for Model-Based Reinforcement Learning

  • Sutton, Machado, Holland, Timbers, Tanner, & White (2023)
  • Why it matters: Part of Sutton's discussion of planning; "comes close to being a complete tabular-ish prototype AI"
  • Focus: Constructing task hierarchies that respect reward structure
  • Read during: Week 6; pairs with probability/statistics understanding of reward functions

[8] Gaps in the Foundations of Planning with Approximation (video)

  • Sutton (2021)
  • Why it matters: "On planning I recommend the video [8]"
  • Content: Identifies open problems and foundational issues in planning with function approximation
  • Format: Video (easier engagement than dense papers)
  • Watch during: Week 5; contextualizes current state of the field

[9] Toward a New Approach to Model-based Reinforcement Learning

  • Sutton (2020)
  • Why it matters: Sutton's unpublished perspective; "if you want more, consider the unpublished paper [9]"
  • Focus: Vision for next-generation model-based RL beyond current approaches
  • Read during: Week 5; complements video [8] on planning foundations

[10] Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

  • Sutton, Szepesvari, Geramifard, Bowling (2008)
  • Why it matters: "The Linear Dyna paper [10]" - practical integration of planning and learning
  • Focus: How to combine model-based planning with function approximation
  • Read during: Week 4; pairs with calculus understanding of optimization and gradient descent

Course Details

Course 1: Linear Algebra for Machine Learning and Data Science

Duration: 34 hours (Weeks 1-3, ~11 hrs/week)

Key Topics: - Vectors and matrices: representing data, properties (rank, singularity, linear independence) - Matrix operations: dot product, inverse, determinants - Linear transformations and their interpretation - Eigenvalues and eigenvectors (essential for dimensionality reduction and GVFs) - Principal Component Analysis (PCA) - Applications to machine learning problems

Why this matters for Sutton papers: - Value functions are vectors in state space - Policy iteration uses matrix operations on transition dynamics - Eigenvalues relate to stationary distributions and convergence - GVFs and Horde architecture use projections (linear transformations) - PCA provides intuition for dimensionality reduction in feature learning

Course 2: Calculus for Machine Learning and Data Science

Duration: 26 hours (Weeks 4-5, ~13 hrs/week)

Key Topics: - Derivatives and gradients of functions - Analytic and approximate optimization - Chain rule and backpropagation - Gradient descent and variants - Loss functions and neural network training

Why this matters for Sutton papers: - TD learning uses gradient descent on value function approximation - Semi-gradient methods (C3 course review + Dyna paper) - Optimization of planning computations - Neural network function approximation (modern RL systems) - Understanding convergence properties

Course 3: Probability & Statistics for Machine Learning & Data Science

Duration: 33 hours (Weeks 6-8, ~11 hrs/week)

Key Topics: - Probability distributions and their properties - Hypothesis testing and confidence intervals - Bayesian inference and Bayesian statistics - Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) - A/B testing and statistical comparison methods - Exploratory Data Analysis

Why this matters for Sutton papers: - Reward estimation as probabilistic inference - Uncertainty in value function approximation - Bayesian methods for exploration-exploitation - Statistical validation of RL agent improvements - Healthcare/security domain requires rigorous statistical methodology


Reading Strategy for Sutton Papers

I have a small git workflow setup to manage the papers and build my literature review. Hoping to do a separate blog posting on that workflow and publish the repo. The workflow goes like this.

Workflow Overview

  1. Add papers to Zotero - Store all 10 Sutton papers in "D.Eng Literature Review" collection
  2. Create structured notes - Use make note TITLE="..." (CLI interface) to scaffold notes from the template
  3. Fill note template - Problem, Method, Findings, Limitations, Relevance, Replication
  4. Update syntheses - Biweekly, synthesize findings into thematic documents
  5. Build documents - Monthly, generate PDF/DOCX with full citations

Setup (One-time)

In Zotero: 1. Create a collection: "D.Eng Literature Review" 2. Install Better BibTeX 3. Right-click collection → Export → Better BibLaTeX 4. Check ☑ "Keep updated" 5. Export to: /Users/keith/code/litreview/bib/library.bib

In litreview repo:

cd litreview
make install        # Install dependencies
make sync-zotero    # Verify Better BibTeX setup

Weekly: Read Papers & Create Notes

For each paper:

cd litreview
make note TITLE="Sutton & Barto (2018) - Reinforcement Learning: An Introduction"
# Creates: notes/sutton-barto-2018-reinforcement-learning-an-introduction.md

Fill in the structured template: - Problem: What research question does this paper address? - Method: What approach/algorithms are used? - Findings: Key results and conclusions - Limitations: Acknowledged weaknesses or open questions - Relevance to D.Eng: How does this connect to research direction? (Intelligence Amplification, healthcare IAM) - Replication: Could you implement this? Any code references?

Example note structure:

# Sutton & Barto (2018) - Reinforcement Learning: An Introduction

## Problem
Comprehensive foundation for RL: how do agents learn from interaction?

## Method
- Value functions and Bellman equations
- Dynamic programming approaches
- Temporal difference learning
- Function approximation

## Findings
...

## Limitations
...

## Relevance to D.Eng
- Core reference for entire capstone
- Establishes foundational concepts for Alberta Plan
- TD learning basis for GVF architecture

## Replication
- Implementation in learnrl/ package
- Focus on understanding value function geometry

Biweekly: Synthesis & Integration

Update synthesis documents that integrate papers thematically:

# Edit main synthesis
vim litreview/syntheses/literature_review.md

# Or create theme-specific syntheses
vim litreview/syntheses/sutton-research-arc.md

Key themes to synthesize: 1. Foundations (1988-1992): TD learning → gradient descent → incremental learning 2. Temporal Abstraction (1999-2008): Options → Dyna → hierarchical RL 3. GVF Architecture (2011): Horde as realization of abstract ideas 4. Modern Vision (2020-2023): Gaps, model-based RL, reward-respecting subtasks 5. Alberta Plan (2023): Integration and 50-year roadmap

Link to notes in syntheses:

Sutton's foundational [TD learning paper](../notes/sutton-1988.md) established
the core learning mechanism used throughout all subsequent work.

Monthly: Build & Commit

Generate formatted documents with citations:

cd litreview
make build
# Creates:
# - docs/literature_review.pdf
# - docs/literature_review.docx

Commit work:

git add notes/ syntheses/ bib/
git commit -m "Add notes for Sutton papers weeks 1-3"

Output Structure

Notes and syntheses will be organized in the litreview repo:

/Users/keith/code/litreview/
├── notes/                        # Individual paper notes
   ├── sutton-barto-2018.md
   ├── sutton-1988-td-learning.md
   ├── sutton-1992-gradient-descent.md
   ├── sutton-precup-singh-1999-options.md
   ├── sutton-modayil-2011-horde.md
   ├── sutton-szepesvari-2008-dyna.md
   ├── sutton-2020-model-based-rl.md
   ├── sutton-2021-gaps-video.md
   ├── sutton-machado-2023-reward-respecting.md
   └── sutton-bowling-pilarski-2023-alberta-plan.md
├── syntheses/
   ├── literature_review.md      # Master document
   └── sutton-research-arc.md    # Thematic integration (Weeks 7-8)
├── bib/
   └── library.bib              # Auto-exported from Zotero
└── docs/
    ├── literature_review.pdf    # Generated monthly
    └── literature_review.docx   # Generated monthly

Synthesis & Outcomes

Weeks 7-8: Integration Document

By the end of Week 6, you'll have deep notes on all 10 papers. Weeks 7-8 focus on creating a comprehensive synthesis document that shows:

  1. Historical arc: How each paper built on previous work
  2. 1988-1992: TD learning foundations
  3. 1999-2011: Temporal abstraction (Options → GVFs)
  4. 2008: Reconciling planning with function approximation
  5. 2020-2023: Modern perspective and open problems
  6. 2023: Alberta Plan as unifying vision

  7. Key concepts across papers:

  8. Value functions and their approximation
  9. Temporal abstraction and hierarchical learning
  10. Generalized Value Functions (GVFs) as core architecture
  11. Learning from unsupervised interaction
  12. Planning and model-based learning reconciliation

  13. Mathematical foundations:

  14. Bellman equations and their operator properties
  15. Gradient descent in non-stationary environments
  16. Eigenstructure of transition matrices
  17. Bayesian interpretation of learning

  18. Path to Intelligence Amplification (Alberta Plan Step 12):

  19. How do GVFs provide explainability?
  20. How does reward-respecting structure support safe RL?
  21. How does continual learning connect to non-stationary healthcare IAM?

Outcomes at Week 8

Mathematical Mastery: - Fluency with linear algebra (vectors, matrices, transformations, eigenvalues) - Strong calculus foundation (gradients, optimization, chain rule) - Rigorous statistics and probability for ML applications - Hands-on Python implementations in all three courses

Research Foundation: - Deep understanding of Sutton's 40-year research program - Clear narrative: TD learning → temporal abstraction → GVFs → Alberta Plan - Recognition of how classical RL concepts scale to modern systems - Appreciation for open problems in planning + approximation

D.Eng. Readiness: - Mathematical sophistication for doctoral coursework - Comprehensive literature foundation in core RL concepts - Research vision grounded in Alberta Plan framework - Prepared to define dissertation research direction in Intelligence Amplification

Artifacts: - Three Coursera specialization certificates - Detailed notes on 10 seminal papers - Integration document showing Sutton's research program - Clear understanding of how mathematics supports RL theory


Next Steps (After Week 8)

  • January 2026: D.Eng. program begins
  • Spring 2026: Capstone project?
  • Year 1 D.Eng.: Determin research topic. Prepare dissertation proposal.
  • Years 2-3 D.Eng.: Dissertation research

Resources

Courses

Sutton's 10 Foundational Readings (Full Citations)

From Sutton's Reading List for His RL Approach to AI (Oct 2023)

  • [1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
  • Note: Sutton recommends the linked online version, which is slightly more up to date than the printed version

  • [2] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.

  • Note: Sutton also recommends a video on TD learning; available in his reading list document

  • [3] Sutton, R. S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. Journal of Artificial Intelligence Research, 1, 161-180.

  • [4] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Machine Learning, 31(2-3), 181-211.

  • [5] Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., & Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2395-2401.

  • [6] Sutton, R. S., Bowling, M., & Pilarski, P. M. (2023). The Alberta plan for AI research. arXiv preprint arXiv:2208.11173.

  • [7] Sutton, R. S., Machado, M. C., Holland, G. Z., Timbers, F., Tanner, B., & White, A. (2023). Reward-respecting subtasks for model-based reinforcement learning. In Learning for Dynamics and Control Conference.

  • [8] Sutton, R. S. (2021). Gaps in the foundations of planning with approximation. Video lecture.

  • [9] Sutton, R. S. (2020). Toward a new approach to model-based reinforcement learning. Unpublished manuscript. DeepMind Alberta.

  • [10] Sutton, R. S., Szepesvári, C., Geramifard, A., & Bowling, M. (2008). Dyna-style planning with linear function approximation and prioritized sweeping. In Proceedings of the International Conference on Machine Learning (ICML), 1007-1014.