LLM Fundamentals: How They Actually Work (and Where They Break)
AI Fundamentals

LLM Fundamentals: How They Actually Work (and Where They Break)

A no-jargon explainer of how large language models work in 2026 — tokens, transformers, training, inference — and the failure modes every practitioner should expect.

By AITraining2U Editorial Team 2025-10-28 10 min read
LLM fundamentals 2026 — tokens, transformers, training, inference, failure modes

If you have used Claude, ChatGPT, or Gemini, you have used a large language model. If you are about to deploy one in your organisation, you should understand a little about how it actually works — not the deep mathematics, but enough mental model to spot when something is going wrong before it costs you money.

This article is the explainer we use with corporate teams who are about to commission their first AI workflow. It is the minimum viable understanding required to ask sensible questions in vendor meetings and not get sold something that does not work.

1. Tokens — the unit of everything

An LLM does not read characters or words. It reads tokens. A token is roughly four characters of English text — sometimes a whole word, sometimes a piece of one ("artific" + "ial"), sometimes a punctuation mark. The Malay word "kerajaan" might be one token or three depending on the model's tokeniser.

Three practical implications. First, you pay providers per token, not per word. Second, the model's output is generated one token at a time, which is why long responses take longer than short ones. Third, the model's "context window" — how much it can read at once — is measured in tokens. Claude's 200K context is roughly 150,000 English words, or one moderately long novel.

2. The transformer — what is happening inside

Modern LLMs are built on the transformer architecture, introduced in the 2017 paper "Attention Is All You Need" (co-authored by Aidan Gomez, now CEO of Cohere). The core mechanic is attention: at each step, the model decides which earlier tokens are most relevant to predicting the next token.

You do not need to understand the linear algebra to use LLMs effectively. You do need to understand one consequence: the model is, fundamentally, a very good next-token predictor. It has no consciousness, no goals, no plans beyond the immediate next token. The appearance of reasoning, planning, and agency emerges from very large numbers of these next-token predictions strung together — sometimes spectacularly well, sometimes badly. We will return to this.

3. Training — where the "intelligence" comes from

How an LLM is trained (2026)

How an LLM is trained (2026) L3Reasoning training (optional)Reinforcement learning on reasoning traces. Behind o3, R1, Claude extended thinking. Adds the "think before answering" capability for hardest tasks.L2Post-trainingSupervised fine-tuning + RLHF/RLAIF. Turns a raw text-completion engine into a useful assistant. Most quality differences between Claude/GPT/Gemini come from this layer.L1Pre-trainingTrillions of tokens of text. The model learns next-token prediction broadly. Costs tens to hundreds of millions of dollars; produces broad knowledge across domains.

An LLM is trained in two main phases.

Pre-training is the expensive phase. The model is shown trillions of tokens of text and learns to predict the next token. This is what gives it broad knowledge — physics, history, code, languages, the structure of legal documents. Pre-training a frontier model in 2025–2026 costs tens to hundreds of millions of dollars and requires specialised compute infrastructure (GPU clusters, custom networking, extensive safety review).

Post-training is the cheaper but vital phase that turns a raw text-completion engine into a useful assistant. Includes supervised fine-tuning (showing the model pairs of question and good answer), RLHF or RLAIF (reinforcement learning from human or AI feedback), and increasingly reasoning training (where the model is taught to think step-by-step before answering). The differences between Claude, GPT, and Gemini in feel and quality come mostly from post-training choices, not from the underlying architecture.

4. Inference — what happens when you call the API

From prompt to response: the inference flow

From prompt to response: the inference flow 1TokensInput encoding
The model breaks your prompt into tokens (~4 characters each). "artificial intelligence" might become 3 tokens. You pay per token, both in and out.
2AttentionTransformer pass
The transformer processes all tokens in parallel, computing "attention" over which earlier tokens matter most for predicting the next one.
3ProbabilitiesDistribution over vocab
The model outputs a probability for every token in its vocabulary as the next likely token. Tens of thousands of options, ranked.
4SamplePick one token
Temperature controls randomness. T=0 picks the most likely; higher T samples from the distribution. The chosen token is appended.
5RepeatUntil done
The new token becomes part of the input. Repeat until the model emits a stop token or hits the max-tokens limit. This is why long outputs take longer.

When you send a prompt to a frontier model, the model runs inference: it processes your prompt as input tokens, then generates output tokens one at a time, sampling from the probability distribution it computes for each next token. The randomness can be controlled with the temperature setting — lower temperature for deterministic tasks, higher for creative ones.

Three practical things to know about inference:

  • Latency matters. Long prompts take longer to process. Long outputs take longer to generate. For real-time use cases, choose a smaller and faster model where possible.
  • Cost scales with both input and output tokens. Output tokens are typically more expensive than input. Anthropic, OpenAI, and Google publish per-million-token pricing — model your monthly cost before deploying anything at scale.
  • Determinism is not the default. Even at temperature 0, identical prompts can produce different outputs. Build for variance.

5. Where they break — the failure modes you must know

The 5 LLM failure modes every team must know

The 5 LLM failure modes every team must know 1Hallucination
Confident but wrong. The model is a next-token predictor, not a fact retriever. Mitigations: RAG, citation requirements, explicit uncertainty prompting.
2Outdated knowledge
Pre-training has a cutoff. Anything after is unknown unless you provide it via context. Asking about recent events without context produces fiction.
3Prompt injection
Adversarial input in a document or message overrides original instructions. Defining attack pattern for AI agents in 2026.
4Lost in the middle
Even with 200K-token context, performance often degrades when relevant info is buried in the middle of a long input.
5Sycophancy
RLHF optimises for helpfulness — and can drift toward telling users what they want to hear. Critical evaluation requires explicit prompting.

The five categories of LLM failure that hit production deployments most often:

Hallucinations. The model produces a confident, plausible-sounding answer that is factually wrong. The single biggest risk in deploying LLMs without guardrails. The model does not know what it does not know. Mitigations include retrieval-augmented generation (RAG), explicit citation requirements, and human review of consequential outputs.

Outdated knowledge. The model's pre-training has a cutoff date. Everything after that date is unknown to the model unless you provide it via context. Asking Claude about events last week without context will produce confident-sounding fiction.

Prompt injection. A security failure mode where adversarial input (in a document the model reads, or a user message) overrides the original instructions. The defining attack pattern for AI agents in 2026.

Drift on long contexts. Even with 200K-token context windows, model performance often degrades when relevant information is buried in the middle of a long input. Known as the "lost in the middle" problem.

Sycophancy and over-confidence. Models are trained on RLHF, which optimises for helpfulness — and can drift toward telling users what they want to hear. Critical evaluation requires explicit prompting ("evaluate critically and disagree where appropriate").

6. Choosing a model in 2026

The major frontier model families in mid-2026:

  • Claude (Anthropic) — Sonnet for most workloads, Opus for hardest reasoning. Strong tool use and long-context performance.
  • GPT (OpenAI) — GPT-5 family. Strong general performance, broad ecosystem, native reasoning models (o3 family).
  • Gemini (Google) — strong multimodal and search-grounded tasks; especially competitive on long context and Google Workspace integration.
  • Open-weight models — Llama (Meta), Qwen (Alibaba), DeepSeek, Mistral. Necessary when data residency or cost requires self-hosting.

Pick based on evaluation against your specific task — not on benchmark headlines. Vellum's LLM Leaderboard is one useful starting point but should not substitute for your own evals.

7. Where to go next

Two natural next steps once the fundamentals are in place. Embeddings and vector databases covers how models represent meaning, which is foundational for retrieval-augmented systems. Production RAG patterns covers how to build systems that ground LLM outputs in your own data — the single most common pattern in enterprise AI in 2026.

For Malaysian teams ready to apply this in practice, our AI Engineering programme covers production LLM engineering end-to-end, HRDC SBL-KHAS claimable for eligible employers.

About the author

AITraining2U Editorial Team →

HRDC-Certified · Practitioner-Led · Malaysia & SEA

The AITraining2U Editorial Team is a working group of practitioners — instructors, working consultants, and HRDC-certified trainers — who collectively deliver AI training to Malaysian organisations across financial services, technology, professional services, and the public sector. Articles attributed to the Editorial Team draw on consolidated learnings from live programmes, corporate engagements, and regional industry research.

Sources & References

All references checked at time of publication. AITraining2U is not affiliated with the cited sources.

Frequently Asked Questions

AI is the umbrella term for any system that performs tasks normally requiring human reasoning. LLMs are one specific type of AI — large language models trained primarily on text. Other AI types include image classifiers, recommendation engines, and reinforcement learning agents. Most enterprise AI deployments in 2026 use LLMs at the core but combine them with other components.

Because they are next-token predictors trained on patterns in their training data. They produce statistically plausible continuations of text, not retrieved facts. When asked something they were not trained on (or have forgotten), they generate something that sounds right rather than admitting uncertainty. Mitigations include retrieval-augmented generation, citation requirements, and explicit uncertainty prompting.

Roughly four characters of English text — sometimes a whole word like 'cat', sometimes a piece of a word like 'tion', sometimes punctuation. A short blog post is about 1,000 tokens. Claude's 200K context window holds about 150,000 English words. You pay providers per token, not per word.

Frontier models (Claude, GPT, Gemini) for most use cases — they are higher quality, faster to iterate, and cheaper than self-hosting equivalent capability. Self-hosted open-weight models (Llama, Qwen, DeepSeek) when data residency requirements forbid sending data to US providers, when you have specialised compliance needs, or when you have predictable high-volume workloads where the per-token economics favour your own infrastructure.

Yes. AITraining2U's AI Engineering and AI Agentic Automation programmes — covering the full LLM stack from fundamentals through production deployment — are HRDC SBL-KHAS claimable for eligible Malaysian employers.

Want to apply this in your organisation?

AITraining2U runs HRDC-claimable corporate AI training for Malaysian organisations — from leadership awareness to hands-on builder workshops. Talk to us about a programme tailored to your team.