Deep Delta / Residual Geometry / Grokking
1. Deep Delta Learning
2. The Delta Rule (Background)
3. Grokking: Generalization Beyond Overfitting
4. Why Neural Networks Suddenly Start Generalizing
Nested / Multi-Timescale / Meta Learning
5. Introducing Nested Learning – Google Research Blog
6. Learning to Learn by Gradient Descent by Gradient Descent
7. Meta-Learning in Neural Networks: A Survey
Speculative Decoding / Draft Models
8. Accelerating Large Language Model Decoding with Speculative Sampling
9. Speculative Decoding with Draft Models (Original Paper)
10. SpecExtend: Scaling Speculative Decoding to Long Contexts
11. Dynamic Depth Decoding for Efficient LLM Inference
EAGLE / Advanced Speculative Heads
12. EAGLE-3: Efficient Accelerated Generation via Learned Drafting
13. From Research to Production: Accelerate OSS LLMs with EAGLE-3 on Vertex AI
Long Context / Memory / RLM-Style Ideas
14. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
15. LongNet: Scaling Transformers to 1M Tokens
N-grams / DeepSeek / Classical Foundations
16. A Tutorial on N-gram Language Models
17. DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning
Optimizer Discovery / RL for Training Rules
18. Learning to Optimize
19. Discovering Optimization Algorithms via Reinforcement Learning
Data / Datasets / Web Corpora
20. FineWeb: A New Large-Scale Web Dataset
21. The Common Crawl Dataset
22. The Pile: An 800GB Dataset of Diverse Text
23. NVIDIA NeMo Data Curation Overview
Flash / Systems / Acceleration
24. dFlash: Fast and Accurate LLM Decoding