A Brief Introduction to the seminal Transformer paper.
Reading the GPT-1 paper.
A Ground-breaking and Efficient Training Approach from DeepSeek.
How Reasoning is evoked in Large Language Models.
A Neural Net Optimizer.
Paper "Attention Is All You Need"
Paper "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"
Paper "RoFormer: Enhanced Transformer with Rotary Position Embedding"
Paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation"
Paper "Improving language models by retrieving from trillions of tokens"
Paper "Compressive Transformers for Long-Range Sequence Modelling"
Paper "Language Models are Unsupervised Multitask Learners"
Paper "GLU Variants Improve Transformer"
Paper "Generalization through Memorization: Nearest Neighbor Language Models"
Paper "Addressing Some Limitations of Transformers with Feedback Memory"