AI Basics Deep Learning Large Language Models Intermediate

Transformer

Pronunciation: /trænsˈfɔːmə/

The neural network architecture that revolutionised AI by enabling models to process entire sequences at once.

Plain English Explanation

The Transformer is a type of neural network architecture introduced in 2017 that processes data using a mechanism called "attention." Instead of reading text word by word in order (like earlier models), Transformers can look at all words in a sentence simultaneously and determine which ones are most relevant to each other.\n\nThis parallel processing makes Transformers faster to train and more effective at capturing long-range relationships in text, which is why virtually all modern language models are built on this architecture.

Analogy

Imagine reading a sentence where you can instantly see connections between any two words, no matter how far apart they are — like having X-ray vision for language structure. That is what attention allows.

How is it used?

Transformers are the foundation of GPT, BERT, Claude, Gemini, and virtually every major language model. They are also used in image generation (DALL-E), protein folding (AlphaFold), and speech recognition.

Real-world Example

When Google Translate produces a natural-sounding translation, or when a chatbot maintains context across a long conversation, Transformer architecture is doing the heavy lifting behind the scenes.

Common Misconceptions

Transformers are not limited to language — the name refers to the architecture, not transforming one thing into another in a general sense.

History

Introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017) by Google researchers. It replaced recurrent neural networks as the dominant approach for sequence processing.

Related Terms

LLM GPT

References

Attention Is All You Need