What is the difference between an LLM and AI?

AI is the umbrella term for any system that performs tasks normally requiring human reasoning. LLMs are one specific type of AI — large language models trained primarily on text. Other AI types include image classifiers, recommendation engines, and reinforcement learning agents. Most enterprise AI deployments in 2026 use LLMs at the core but combine them with other components.

Why do LLMs hallucinate?

Because they are next-token predictors trained on patterns in their training data. They produce statistically plausible continuations of text, not retrieved facts. When asked something they were not trained on (or have forgotten), they generate something that sounds right rather than admitting uncertainty. Mitigations include retrieval-augmented generation, citation requirements, and explicit uncertainty prompting.

How big is a token, in plain English?

Roughly four characters of English text — sometimes a whole word like 'cat', sometimes a piece of a word like 'tion', sometimes punctuation. A short blog post is about 1,000 tokens. Claude's 200K context window holds about 150,000 English words. You pay providers per token, not per word.

Should I use a frontier model or self-host an open-weight model?

Frontier models (Claude, GPT, Gemini) for most use cases — they are higher quality, faster to iterate, and cheaper than self-hosting equivalent capability. Self-hosted open-weight models (Llama, Qwen, DeepSeek) when data residency requirements forbid sending data to US providers, when you have specialised compliance needs, or when you have predictable high-volume workloads where the per-token economics favour your own infrastructure.

Is LLM training HRDC claimable?

Yes. AITraining2U's AI Engineering and AI Agentic Automation programmes — covering the full LLM stack from fundamentals through production deployment — are HRDC SBL-KHAS claimable for eligible Malaysian employers.

LLM Fundamentals: How They Actually Work (and Where They Break)

If you have used Claude, ChatGPT, or Gemini, you have used a large language model. If you are about to deploy one in your organisation, you should understand a little about how it actually works — not the deep mathematics, but enough mental model to spot when something is going wrong before it costs you money.

This article is the explainer we use with corporate teams who are about to commission their first AI workflow. It is the minimum viable understanding required to ask sensible questions in vendor meetings and not get sold something that does not work.

1. Tokens — the unit of everything

An LLM does not read characters or words. It reads tokens. A token is roughly four characters of English text — sometimes a whole word, sometimes a piece of one ("artific" + "ial"), sometimes a punctuation mark. The Malay word "kerajaan" might be one token or three depending on the model's tokeniser.

Three practical implications. First, you pay providers per token, not per word. Second, the model's output is generated one token at a time, which is why long responses take longer than short ones. Third, the model's "context window" — how much it can read at once — is measured in tokens. Claude's 200K context is roughly 150,000 English words, or one moderately long novel.

2. The transformer — what is happening inside

Modern LLMs are built on the transformer architecture, introduced in the 2017 paper "Attention Is All You Need" (co-authored by Aidan Gomez, now CEO of Cohere). The core mechanic is attention: at each step, the model decides which earlier tokens are most relevant to predicting the next token.

You do not need to understand the linear algebra to use LLMs effectively. You do need to understand one consequence: the model is, fundamentally, a very good next-token predictor. It has no consciousness, no goals, no plans beyond the immediate next token. The appearance of reasoning, planning, and agency emerges from very large numbers of these next-token predictions strung together — sometimes spectacularly well, sometimes badly. We will return to this.

3. Training — where the "intelligence" comes from

How an LLM is trained (2026)

An LLM is trained in two main phases.

Pre-training is the expensive phase. The model is shown trillions of tokens of text and learns to predict the next token. This is what gives it broad knowledge — physics, history, code, languages, the structure of legal documents. Pre-training a frontier model in 2025–2026 costs tens to hundreds of millions of dollars and requires specialised compute infrastructure (GPU clusters, custom networking, extensive safety review).

Post-training is the cheaper but vital phase that turns a raw text-completion engine into a useful assistant. Includes supervised fine-tuning (showing the model pairs of question and good answer), RLHF or RLAIF (reinforcement learning from human or AI feedback), and increasingly reasoning training (where the model is taught to think step-by-step before answering). The differences between Claude, GPT, and Gemini in feel and quality come mostly from post-training choices, not from the underlying architecture.

4. Inference — what happens when you call the API

From prompt to response: the inference flow

When you send a prompt to a frontier model, the model runs inference: it processes your prompt as input tokens, then generates output tokens one at a time, sampling from the probability distribution it computes for each next token. The randomness can be controlled with the temperature setting — lower temperature for deterministic tasks, higher for creative ones.

Three practical things to know about inference:

Latency matters. Long prompts take longer to process. Long outputs take longer to generate. For real-time use cases, choose a smaller and faster model where possible.
Cost scales with both input and output tokens. Output tokens are typically more expensive than input. Anthropic, OpenAI, and Google publish per-million-token pricing — model your monthly cost before deploying anything at scale.
Determinism is not the default. Even at temperature 0, identical prompts can produce different outputs. Build for variance.

5. Where they break — the failure modes you must know

The 5 LLM failure modes every team must know

The five categories of LLM failure that hit production deployments most often:

Hallucinations. The model produces a confident, plausible-sounding answer that is factually wrong. The single biggest risk in deploying LLMs without guardrails. The model does not know what it does not know. Mitigations include retrieval-augmented generation (RAG), explicit citation requirements, and human review of consequential outputs.

Outdated knowledge. The model's pre-training has a cutoff date. Everything after that date is unknown to the model unless you provide it via context. Asking Claude about events last week without context will produce confident-sounding fiction.

Prompt injection. A security failure mode where adversarial input (in a document the model reads, or a user message) overrides the original instructions. The defining attack pattern for AI agents in 2026.

Drift on long contexts. Even with 200K-token context windows, model performance often degrades when relevant information is buried in the middle of a long input. Known as the "lost in the middle" problem.

Sycophancy and over-confidence. Models are trained on RLHF, which optimises for helpfulness — and can drift toward telling users what they want to hear. Critical evaluation requires explicit prompting ("evaluate critically and disagree where appropriate").

6. Choosing a model in 2026

The major frontier model families in mid-2026:

Claude (Anthropic) — Sonnet for most workloads, Opus for hardest reasoning. Strong tool use and long-context performance.
GPT (OpenAI) — GPT-5 family. Strong general performance, broad ecosystem, native reasoning models (o3 family).
Gemini (Google) — strong multimodal and search-grounded tasks; especially competitive on long context and Google Workspace integration.
Open-weight models — Llama (Meta), Qwen (Alibaba), DeepSeek, Mistral. Necessary when data residency or cost requires self-hosting.

Pick based on evaluation against your specific task — not on benchmark headlines. Vellum's LLM Leaderboard is one useful starting point but should not substitute for your own evals.

7. Where to go next

Two natural next steps once the fundamentals are in place. Embeddings and vector databases covers how models represent meaning, which is foundational for retrieval-augmented systems. Production RAG patterns covers how to build systems that ground LLM outputs in your own data — the single most common pattern in enterprise AI in 2026.

For Malaysian teams ready to apply this in practice, our AI Engineering programme covers production LLM engineering end-to-end, HRDC SBL-KHAS claimable for eligible employers.

LLM Fundamentals: How They Actually Work (and Where They Break)

1. Tokens — the unit of everything

2. The transformer — what is happening inside

3. Training — where the "intelligence" comes from

How an LLM is trained (2026)

4. Inference — what happens when you call the API

From prompt to response: the inference flow

5. Where they break — the failure modes you must know

The 5 LLM failure modes every team must know

6. Choosing a model in 2026

7. Where to go next

About the author

Sources & References

More in AI Fundamentals

Reasoning Models in 2026: o3, DeepSeek R1, and Claude Extended Thinking

Model Context Protocol (MCP) Explained: The 2026 Standard for AI Tools

Production RAG: Patterns, Pitfalls, and What Actually Works

Frequently Asked Questions

Want to apply this in your organisation?