Deep Delta / Residual Geometry / Grokking

1. Deep Delta Learning
2. The Delta Rule (Background)
3. Grokking: Generalization Beyond Overfitting
4. Why Neural Networks Suddenly Start Generalizing

Nested / Multi-Timescale / Meta Learning

5. Introducing Nested Learning – Google Research Blog
6. Learning to Learn by Gradient Descent by Gradient Descent
7. Meta-Learning in Neural Networks: A Survey

Speculative Decoding / Draft Models

8. Accelerating Large Language Model Decoding with Speculative Sampling
9. Speculative Decoding with Draft Models (Original Paper)
10. SpecExtend: Scaling Speculative Decoding to Long Contexts
11. Dynamic Depth Decoding for Efficient LLM Inference

EAGLE / Advanced Speculative Heads

12. EAGLE-3: Efficient Accelerated Generation via Learned Drafting
13. From Research to Production: Accelerate OSS LLMs with EAGLE-3 on Vertex AI

Long Context / Memory / RLM-Style Ideas

14. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
15. LongNet: Scaling Transformers to 1M Tokens

N-grams / DeepSeek / Classical Foundations

16. A Tutorial on N-gram Language Models
17. DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning

Optimizer Discovery / RL for Training Rules

18. Learning to Optimize
19. Discovering Optimization Algorithms via Reinforcement Learning

Data / Datasets / Web Corpora

20. FineWeb: A New Large-Scale Web Dataset
21. The Common Crawl Dataset
22. The Pile: An 800GB Dataset of Diverse Text
23. NVIDIA NeMo Data Curation Overview

Flash / Systems / Acceleration

24. dFlash: Fast and Accurate LLM Decoding

Earlier Readings

Attention is all you need
Seminal paper on transformers
The Curse of Dimensionality
Long time pending from my uni time