How do LLMs even work?

Large Language Models (LLMs) are probabilistic models, typically based on the transformer architecture, trained via gradient-based machine learning to predict the next token in a sequence.

They don’t “think” or maintain persistent memory. During inference, a pre-trained model processes an input sequence and generates output token-by-token. LLMs are stateless across requests, but fully condition on all tokens within the current context window.

Key parameters and concepts:

context window size: the maximum number of tokens (input + output) the model can process in a single request. Frontier models can reach ~1M tokens; ~100k–250k is typical for strong models.
temperature: controls randomness in token sampling. Lower values bias toward high-probability tokens (more deterministic); higher values increase diversity by allowing lower-probability tokens to be selected.
weights and fine-tuning: the model consists of learned parameters (“weights”) arranged in matrices across layers. These encode statistical relationships between tokens. Fine-tuning adjusts these weights to specialise behaviour on specific data or tasks.

Tokenisation

LLMs operate on tokens, not raw text. Tokens are subword units (e.g. “un”, “likely”, “##hood”).

raw text:      "unlikely"
                   │
            ┌──────┴─────┐
tokens:    [un]  [like] [ly]
            │      │      │
token ids: [348] [2193] [306]

Implications:

cost scales with token count, not characters
prompt design must consider token efficiency
edge cases (code, JSON, whitespace) matter

Next-token prediction

At its core, an LLM repeatedly does:

Encode input tokens into vectors (embeddings)
Pass them through transformer layers (attention + MLPs)
Produce logits (unnormalised probabilities over vocabulary)
Sample/select the next token
Append token and repeat

input: [The] [cat] [sat]
         │     │     │
         ▼     ▼     ▼
┌────────────────────────┐
│      Embedding         │  map token ids → dense vectors
└───────────┬────────────┘
            ▼
┌────────────────────────┐
│  Transformer Layer ×N  │  self-attention + feed-forward
└───────────┬────────────┘
            ▼
┌────────────────────────┐
│   Logits (vocab size)  │  unnormalised scores over all tokens
└───────────┬────────────┘
            ▼
┌────────────────────────┐
│  Sampling / Argmax     │  pick next token
└───────────┬────────────┘
            ▼
           [on]  ← append to sequence, repeat

This loop continues until a stop condition is reached.

Attention (the core primitive)

Transformers rely on self-attention:

Every token attends to every other token in the sequence
Attention weights determine relevance between tokens

Sequence: [The] [cat] [sat] [on] [the] [___]

Attention from [___] to all previous tokens:

[The]  ░░░░░░░░░░░░░░░░░░             low
[cat]  ████████████████████████████   high  ← subject
[sat]  ██████████████████████         med   ← verb
[on]   ████████████████████████████   high  ← preposition
[the]  ███████████████                med   ← article
                                      ────────────►
                                       attention weight

Intuition: Instead of fixed rules, the model dynamically decides:

“Which previous tokens matter for predicting the next one?”

This is why LLMs can:

track long dependencies
follow instructions
mimic structure (e.g. code, JSON)

Sampling controls (beyond temperature)

Temperature is only one lever. Others include:

top-k: restrict sampling to the k most likely tokens
top-p (nucleus sampling): restrict to smallest set of tokens whose cumulative probability ≥ p
frequency / presence penalties: discourage repetition

These directly affect:

determinism
verbosity
hallucination rate

Logits after softmax (probability distribution over vocab):

token   prob     temperature=0.2         temperature=1.0
─────   ─────    ──────────────────      ──────────────────
"mat"   0.45     █████████████████       █████████
"rug"   0.25     ████████░░░░░░░░░       █████
"bed"   0.15     ████░░░░░░░░░░░░░       ███
"hat"   0.10     ██░░░░░░░░░░░░░░░       ██
"sky"   0.05     ░░░░░░░░░░░░░░░░░       █
                 ▲ concentrated          ▲ spread out
                 (nearly deterministic) (more creative)

With top-k=3:    only [mat, rug, bed] are candidates
With top-p=0.85: only [mat, rug, bed] (cumulative 0.85)

Why hallucinations happen

LLMs optimise for:

“What token is statistically likely next?”

—not:

“What is true?”

So they will:

confidently generate plausible but incorrect information
fill gaps when context is missing
prefer fluency over factuality

Mitigations:

better prompting
retrieval (RAG)
constrained decoding
fine-tuning

Fine-tuning vs prompting vs RAG

Three different levers:

prompting: steer behaviour at runtime (cheap, flexible)
fine-tuning: modify weights (expensive, persistent)
RAG (retrieval-augmented generation): inject external knowledge at inference

Rule of thumb:

behaviour → prompt
knowledge → RAG
style/consistency → fine-tune

Harnesses

To practically interface with LLMs, we build applications around them, called harnesses. A harness contains the LLM’s probabilistic behaviour, enhances it and steers it towards deterministic outcomes.

A good harness has:

a loop to feed a continuous conversation into the model
configuration options or an interface, for customizing model behaviour
observability, to allow users to adjust their inputs based on how the model responds
a toolset, to allow the model to perform tasks

┌─────────────────────────────────────────────────┐
│                   Harness                       │
│                                                 │
│   ┌───────────┐    ┌───────────┐    ┌─────────┐ │
│   │  Config   │    │    LLM    │    │  Tools  │ │
│   │ (prompts, │───▶│  (infer)  │───▶│ (act on │ │
│   │  params)  │    │           │    │  world) │ │
│   └───────────┘    └─────┬─────┘    └────┬────┘ │
│                          │               │      │
│                    ┌─────▼───────────────▼───┐  │
│                    │     Conversation loop   │  │
│                    │  (accumulate + re-send) │  │
│                    └────────────┬────────────┘  │
│                                 │               │
│                    ┌────────────▼────────────┐  │
│                    │     Observability       │  │
│                    │  (logs, metrics, traces)│  │
│                    └─────────────────────────┘  │
└─────────────────────────────────────────────────┘

The diagram above describes a chatbot harness — user input in, text out. An agent harness adds a feedback path: when the model’s output contains tool calls, the harness executes them and appends the results to the conversation before the next inference call.

This feedback path makes the harness a loop. The loop introduces several concerns that a single-turn harness does not have:

streaming: tokens arrive incrementally, but tool calls must be fully assembled before execution
interrupts: users need to be able to abort a loop heading in the wrong direction, and external systems may need to preempt it with urgent events — the loop must support pause, yield, and resume
context growth: each tool call and result adds tokens to the transcript, which will eventually exceed the context window
concurrency: independent tool calls benefit from parallel execution, but the model needs all results before it can continue
safety: the model can request arbitrary actions — the harness must decide which ones to permit

Chat harness (open loop):

  User ──▶ Model ──▶ Text ──▶ User


Agent harness (closed loop):

  User ──▶ Model ──┬──▶ Text ──▶ User
                   │
                   ├──▶ Tool call
                   │       │
                   │    Execute
                   │       │
                   │    Result
                   │       │
                   └───────┘  ← feed back, model continues

From harness to toolkit

A minimal agent loop is straightforward to implement. Handling all of the above — and composing cleanly into different host applications (a CLI, a web server, a multi-agent system) — requires deliberate decomposition.

agentkit splits the agent harness into independent crates, each responsible for one concern:

Concern	agentkit crate
Transcript data model	`agentkit-core`
Agent loop and driver	`agentkit-loop`
Tool abstraction	`agentkit-tools-core`
Filesystem and shell tools	`agentkit-tool-fs`, `agentkit-tool-shell`
Permission system	`agentkit-capabilities`
Context loading	`agentkit-context`
Transcript compaction	`agentkit-compaction`
MCP integration	`agentkit-mcp`
Task management	`agentkit-task-manager`
Observability	`agentkit-reporting`
Provider adapters	`agentkit-provider-*`

Each crate can be used independently. The core loop is agnostic to the model provider, tool set, and presentation layer. The rest of this book builds up each piece, starting from the loop itself.

Chapter 2: What is an agent loop? →

Keyboard shortcuts

agentkit