If you have used Claude, ChatGPT, or Gemini, you have used a large language model. If you are about to deploy one in your organisation, you should understand a little about how it actually works — not the deep mathematics, but enough mental model to spot when something is going wrong before it costs you money.
This article is the explainer we use with corporate teams who are about to commission their first AI workflow. It is the minimum viable understanding required to ask sensible questions in vendor meetings and not get sold something that does not work.
1. Tokens — the unit of everything
An LLM does not read characters or words. It reads tokens. A token is roughly four characters of English text — sometimes a whole word, sometimes a piece of one ("artific" + "ial"), sometimes a punctuation mark. The Malay word "kerajaan" might be one token or three depending on the model's tokeniser.
Three practical implications. First, you pay providers per token, not per word. Second, the model's output is generated one token at a time, which is why long responses take longer than short ones. Third, the model's "context window" — how much it can read at once — is measured in tokens. Claude's 200K context is roughly 150,000 English words, or one moderately long novel.
2. The transformer — what is happening inside
Modern LLMs are built on the transformer architecture, introduced in the 2017 paper "Attention Is All You Need" (co-authored by Aidan Gomez, now CEO of Cohere). The core mechanic is attention: at each step, the model decides which earlier tokens are most relevant to predicting the next token.
You do not need to understand the linear algebra to use LLMs effectively. You do need to understand one consequence: the model is, fundamentally, a very good next-token predictor. It has no consciousness, no goals, no plans beyond the immediate next token. The appearance of reasoning, planning, and agency emerges from very large numbers of these next-token predictions strung together — sometimes spectacularly well, sometimes badly. We will return to this.
3. Training — where the "intelligence" comes from
How an LLM is trained (2026)
An LLM is trained in two main phases.
Pre-training is the expensive phase. The model is shown trillions of tokens of text and learns to predict the next token. This is what gives it broad knowledge — physics, history, code, languages, the structure of legal documents. Pre-training a frontier model in 2025–2026 costs tens to hundreds of millions of dollars and requires specialised compute infrastructure (GPU clusters, custom networking, extensive safety review).
Post-training is the cheaper but vital phase that turns a raw text-completion engine into a useful assistant. Includes supervised fine-tuning (showing the model pairs of question and good answer), RLHF or RLAIF (reinforcement learning from human or AI feedback), and increasingly reasoning training (where the model is taught to think step-by-step before answering). The differences between Claude, GPT, and Gemini in feel and quality come mostly from post-training choices, not from the underlying architecture.
4. Inference — what happens when you call the API
From prompt to response: the inference flow
When you send a prompt to a frontier model, the model runs inference: it processes your prompt as input tokens, then generates output tokens one at a time, sampling from the probability distribution it computes for each next token. The randomness can be controlled with the temperature setting — lower temperature for deterministic tasks, higher for creative ones.
Three practical things to know about inference:
- Latency matters. Long prompts take longer to process. Long outputs take longer to generate. For real-time use cases, choose a smaller and faster model where possible.
- Cost scales with both input and output tokens. Output tokens are typically more expensive than input. Anthropic, OpenAI, and Google publish per-million-token pricing — model your monthly cost before deploying anything at scale.
- Determinism is not the default. Even at temperature 0, identical prompts can produce different outputs. Build for variance.
5. Where they break — the failure modes you must know
The 5 LLM failure modes every team must know
The five categories of LLM failure that hit production deployments most often:
Hallucinations. The model produces a confident, plausible-sounding answer that is factually wrong. The single biggest risk in deploying LLMs without guardrails. The model does not know what it does not know. Mitigations include retrieval-augmented generation (RAG), explicit citation requirements, and human review of consequential outputs.
Outdated knowledge. The model's pre-training has a cutoff date. Everything after that date is unknown to the model unless you provide it via context. Asking Claude about events last week without context will produce confident-sounding fiction.
Prompt injection. A security failure mode where adversarial input (in a document the model reads, or a user message) overrides the original instructions. The defining attack pattern for AI agents in 2026.
Drift on long contexts. Even with 200K-token context windows, model performance often degrades when relevant information is buried in the middle of a long input. Known as the "lost in the middle" problem.
Sycophancy and over-confidence. Models are trained on RLHF, which optimises for helpfulness — and can drift toward telling users what they want to hear. Critical evaluation requires explicit prompting ("evaluate critically and disagree where appropriate").
6. Choosing a model in 2026
The major frontier model families in mid-2026:
- Claude (Anthropic) — Sonnet for most workloads, Opus for hardest reasoning. Strong tool use and long-context performance.
- GPT (OpenAI) — GPT-5 family. Strong general performance, broad ecosystem, native reasoning models (o3 family).
- Gemini (Google) — strong multimodal and search-grounded tasks; especially competitive on long context and Google Workspace integration.
- Open-weight models — Llama (Meta), Qwen (Alibaba), DeepSeek, Mistral. Necessary when data residency or cost requires self-hosting.
Pick based on evaluation against your specific task — not on benchmark headlines. Vellum's LLM Leaderboard is one useful starting point but should not substitute for your own evals.
7. Where to go next
Two natural next steps once the fundamentals are in place. Embeddings and vector databases covers how models represent meaning, which is foundational for retrieval-augmented systems. Production RAG patterns covers how to build systems that ground LLM outputs in your own data — the single most common pattern in enterprise AI in 2026.
For Malaysian teams ready to apply this in practice, our AI Engineering programme covers production LLM engineering end-to-end, HRDC SBL-KHAS claimable for eligible employers.