What is an Attention?
A mechanism that lets a model focus on the most relevant parts of its input when producing an output, weighting what matters most in context.
Attention explained in plain English
Attention is a mechanism that lets a model focus on the most relevant parts of its input when producing an output. Instead of treating every word or pixel equally, it weighs what matters most in context.
It solved a major bottleneck in processing long sequences and enabled the Transformer architecture.
Analogy
Attention is like reading a sentence and instinctively emphasising the words that carry the meaning, while barely registering filler words. Your focus shifts depending on what the sentence is actually about.
Example
When translating "The bank by the river" versus "The bank approved the loan," an attention-based model focuses on different words to resolve ambiguity.
How is Attention used?
Attention is the foundation of Transformers — the architecture behind GPT, Claude, and most modern language models. It also helps translation systems align words across languages.
Common misconceptions about Attention
Attention is not human-like focus or consciousness — it is a mathematical weighting scheme over inputs.
People also read
- auto-regressive model
A model that infers a prediction based on its own previous predictions.
- autoencoder
A system that learns to extract the most important information from the input.
- depth
The sum of the following in a neural network: - the number of hidden layers - the number of output layers, which is typically 1 - the number of any embedding layers For example, a neural network with five hidden layers and one output layer has a depth of 6.
- embedding layer
A special hidden layer that trains on a high-dimensional categorical feature to gradually learn a lower dimension embedding vector.
- embedding vector
Broadly speaking, an array of floating-point numbers taken from any hidden layer that describe the inputs to that hidden layer.
- generative AI
An emerging transformative field with no formal definition.
- Long Short-Term Memory
A type of cell in a recurrent neural network used to process sequences of data in applications such as handwriting recognition, machine translation, and image captioning.
- mixture of experts
A scheme to increase neural network efficiency by using only a subset of its parameters (known as an expert) to process a given input token or example.
- Neural Architecture Search
A technique for automatically designing the architecture of a neural network.
- pooling
Reducing a matrix (or matrixes) created by an earlier convolutional layer to a smaller matrix.