Here is how I built a simple coding agent

spek is a side project: a small LLM-powered coding agent that takes a plain-language description of a program and produces a working, tested Python package. You write a markdown file describing what you want, point spek at it, and a few minutes later there’s an installable package on disk that builds and passes its own tests.

This post walks through one example end-to-end, then digs into the design questions I had to answer along the way and what I picked for each.

The full source is at github.com/mkarots/spek.

A worked example: the weather converter

Here is the entire input that produces a working CLI tool. It lives at examples/weather-tool/SPEC.md:

# Name
Weather-converter

this tool converts weather from one metric system to another.
i.e from fahrenheit to celsius and the opposite

# Input
- --from: string, one of (F, C)
- --to: string, one of (F,C)
- value: positional argument, float, the number of degrees

# Output
The program outputs in stdout the converted value to the desired format

# Implementation
Language: python

That is intentionally terse — fewer than twenty lines, with plenty left unsaid (what should the output look like exactly? what happens if --from and --to are the same? what about invalid units?). The agent has to either fill in those gaps itself or ask the user via the epistemic tool.

The CLI surface is deliberately tiny — two subcommands and that’s it:

$ spek --help
usage: spek [-h] {init,build} ...

Agentic spec-to-code builder.

positional arguments:
  {init,build}
    init        Scaffold an empty SPEC.md and .spek/.
    build       Run the agent against a SPEC.md spec.

options:
  -h, --help    show this help message and exit

spek init <dir> writes an empty SPEC.md skeleton (with the four required headers — Name, Input, Output, Implementation) plus an empty .spek/ directory in the target folder. It exists so a new user can go from “I have an idea” to “I have a valid spec to fill in” in one command, instead of guessing the section names from the docs.

spek build <spec> --workdir <dir> is the agent itself. It validates the spec, boots a Docker container with the workdir bind-mounted at /work, walks the three-phase loop (clarify → plan → execute), and either succeeds with green build+tests or exits non-zero with a clear reason. It accepts a few flags worth knowing about:

--fresh wipes .spek/ before starting. Use this when you’ve changed the spec significantly and want the agent to re-plan from scratch instead of resuming from a stale journal.
--confirm-plan pauses after phase 2 and waits for you to approve (or edit) .spek/plan.md before any code is written. Recommended for anything you actually care about.
--max-steps and --max-seconds cap how far the agent can run before giving up. The defaults (120 tool calls, 30 minutes) are generous enough for most specs.

So the canonical first-run looks like:

spek init my-project
$EDITOR my-project/SPEC.md            # describe what you want
spek build my-project/SPEC.md --workdir my-project --confirm-plan

For the worked example below, the spec already exists:

spek build examples/weather-tool/SPEC.md --workdir /tmp/weather

A few minutes later, /tmp/weather contains a normal Python project — pyproject.toml, source under src/, tests under tests/ — and:

➜  weather-tool git:(main) ✗ uv run weather-converter --from C --to F 0
32.00
➜  weather-tool git:(main) ✗ uv run weather-converter --from F --to C 32
0.00
➜  weather-tool git:(main) ✗ uv run pytest
============================== 27 passed in 0.10s ==============================

The two-decimal-place output is the agent honouring the answer the user gave during the clarify phase (“2 decimal places”), not a hardcoded assumption.

Here is what the run actually looks like (one real session, lightly elided in the middle for readability):

[   0s] spek: build start — spec=SPEC.md workdir=/.../examples/weather-tool
[   4s] ==> phase clarify
[   4s]   .. [clarify] LLM turn 1
[  12s]    * tool epistemic(What should the output format/precision be ...)

--- spek needs input ---------------------------------------
Question      : What should the output format/precision be for the converted value?
                Should it be rounded to a specific number of decimal places?
Why blocked   : The spec says "outputs the converted value" but doesn't specify
                the numeric format or decimal precision.
How to resolve: A brief description like "2 decimal places" or "print as-is".
> 2 decimal places
[  18s]    = epistemic ok: 2 decimal places
[  18s]   .. [clarify] LLM turn 2
[  22s]    * tool epistemic(What should happen if --from and --to are the same ...)

--- spek needs input ---------------------------------------
Question      : What should happen if --from and --to are the same (e.g., both F)?
Why blocked   : The spec doesn't clarify this edge case.
How to resolve: Either "passthrough (output same value)" or "raise an error".
> passthrough (output same value)
[  29s]    = epistemic ok: passthrough (output same value)
[  35s] <== phase clarify done (3 turns)

[  35s] ==> phase plan
[  38s]    * tool grep('pyproject' in .)
[  46s]    * tool write_file(.spek/plan.md (1340 bytes))
[  48s] <== phase plan done (3 turns)

[  48s] ==> phase execute
[  51s]    * tool read_file(.spek/plan.md)
[  55s]    * tool write_file(pyproject.toml (403 bytes))
[  58s]    * tool bash(cd /work && uv sync 2>&1)             exit_code=1
[  61s]    * tool write_file(weather_converter/__init__.py)
[  64s]    * tool bash(cd /work && uv sync 2>&1)             exit_code=0
... 16 more execute turns: write_file, bash, write_file ...
[ 146s]    * tool bash(cd /work && uv build 2>&1)            exit_code=0
[ 150s]    * tool bash(cd /work && uv run pytest -v 2>&1)    exit_code=0
[ 162s] <== phase execute done (24 turns)
[ 162s] spek: ✓ build+test green (tests passed: 27)

A few things worth pointing out before we get to the design:

The two epistemic prompts are the agent recognising real ambiguity in the spec (“how should the output be formatted?” and “what about same-unit conversions?”) and asking instead of guessing.
The first uv sync exits 1 — the agent had written pyproject.toml before creating the package directory. It noticed, fixed it, and re-ran. That’s the loop working as intended.
The run terminates because the configured build and test commands both exit 0 and 27 tests were collected — not because the model said “I’m done.” More on this in the design section below.

That’s the whole user experience. The interesting stuff is what happens between the spec and the package.

How it works, and why

Before writing any code I had to answer six questions. They look mundane written down, but each one has at least two reasonable answers, and the choice shapes everything else. I’ll list them first so you can think about them yourself, then walk through what I picked and why.

What does “specification” actually mean as an input format? Should it be completely free-form, or have some structure?
Which tools do you give the model, and why?
Where does the agent run? It’s going to execute arbitrary code; that has to happen somewhere.
How do you know when the agent is done? What conditions actually mark completion?
What happens if the agent crashes mid-run? Can we checkpoint and resume, or do we start over?
How do you inspect what the agent did? During a run and after.

1. What does “specification” mean?

The two extremes:

Free-form text. The user types whatever they want, the LLM figures it out. Best UX, lowest learning curve. Worst determinism — the same spec on two different runs can produce wildly different programs.
Strict schema. JSON or YAML with required fields, validated up front. Maximally reproducible, but now the user has to learn the schema before writing their first program, and you keep discovering fields the schema doesn’t cover.

What I picked: markdown with a small set of required headers. A spec is a markdown file with # Name, # Input, # Output, and # Implementation sections, plus any number of optional # <key> sections for per-parameter detail (you can see all of these in the weather example above).

Why: markdown is the path of least resistance — anyone who has used GitHub already knows it. The required headers give the LLM something predictable to anchor every prompt to (“look at the # Input section and propose a CLI signature”), and let me reject malformed specs at parse time before spending a single token. Extra sections are preserved verbatim, so the format never gets in the user’s way.

If I were starting again I’d add a step where the agent itself can lift a piece of free-form text into this structured form during the clarify phase. That keeps onboarding zero-effort while still giving the rest of the pipeline a reliable shape to work with.

2. Which tools to give the model?

The temptation is to give the model everything. A “Linux command” tool, a Python REPL, an HTTP client, a package installer, a code formatter, an AST editor. The more it has, the more it can do, right?

In practice the opposite is true. Each new tool widens the model’s decision tree without necessarily giving it a new capability — “format the code” and “run black .” are the same action, and offering both just introduces a new way to choose wrong. Too few tools and the model gets stuck; too many and it spends turns picking between near-duplicates.

What I picked: five tools, total.

Tool	What it does
`read_file`	Read a file inside the working directory.
`write_file`	Create or overwrite a file inside the working directory.
`grep`	Search for a string across the working directory.
`bash`	Run a shell command inside the sandbox.
`epistemic`	Pause the run and ask the human a question; the answer comes back as text.

The first four are obvious. The last one is the one I’d defend hardest: epistemic turns “I don’t know what unit the user wants” from a blocker into a tool call. The model writes the question, the agent prints it to the terminal, the user types an answer, and that answer is fed back into the conversation. It’s a tiny primitive but it shows up constantly — anything genuinely ambiguous in the spec gets resolved this way instead of guessed.

I also use the SDK’s built-in tool-calling support rather than asking the model to write tool calls as text and parsing them with regex. That’s a small thing in retrospect but it saves you owning a parser for free-form model output, which is fragile.

3. Where does the agent run?

The choices:

Directly on the host. Simplest. Also: the model is going to run bash commands it generated itself. One bad rm and your laptop is having a worse day than you are.
In a VM. Strong isolation, slow startup, heavy.
In a Docker container. Decent isolation, fast startup, files can be shared with the host via bind mounts.

What I picked: Docker, and it’s required, not optional. Every bash and write_file call goes through a long-lived container started when the agent boots. The host working directory is bind-mounted into the container so files written inside are immediately visible outside (and owned by the right user, not root). If Docker isn’t running, spek build refuses to start and tells you why — it does not silently fall back to running on the host.

I’d rather the tool be unusable without Docker than have a “convenient” mode where a bad command takes out files outside the working directory.

4. How do you know when the agent is done?

There are two broad approaches:

Trust the model. Let it end its turn when it thinks it’s finished and treat that as the stop signal. Simple, and it works most of the time.
Verify externally. Don’t take the model’s word for it; check some deterministic condition about the actual state of the working directory.

What I picked: verify externally. The agent is “done” when:

The configured build command has exited 0 in the recent journal, AND
The configured test command has exited 0 in the recent journal, AND
That test run collected at least one test.

The third condition is the important one. Without it, an agent that deletes the test directory satisfies “build green, tests green” — there just aren’t any tests left to fail. With it, the only way to terminate is to actually have tests that pass.

The “configured” build and test commands aren’t hard-coded — they’re read from a small command_config.json file the agent writes into .spek/ at the start of the run (e.g. "build_command": "uv build", "test_command": "uv run pytest" for the Python profile). That’s what keeps the termination check language-agnostic. More on this file below.

Why not trust the model? Mainly because the cost of a false positive is high — the user runs spek build, the agent reports success, and they only find out the package doesn’t actually work when they try to use it. A check based on real exit codes from real shell commands is cheap to write, deterministic, and easy to test. It also keeps the stop condition in one place that I can change without re-prompting anything.

5. What happens if the agent crashes mid-run?

The naive answer is “start over.” That’s fine for a 30-second run. For a multi-minute run that has already asked the user three clarifying questions and produced a plan, it’s painful — and worse, the user has to re-answer questions they already answered, which means the agent might make different decisions the second time.

What I picked: journal everything, replay on resume.

Every meaningful event — every assistant turn, every tool call, every tool result, every phase transition — is appended as a line to a single JSONL file (.spek/journal.jsonl) inside the working directory. fsync after each write, so a hard crash at any point loses at most the last entry.

On the next run, spek reads the journal, reconstructs the conversation exactly as it was, figures out which phase it was in from the most recent phase marker, and continues from there. If a tool call was in flight when the crash happened (an assistant turn with no matching result), the call is just replayed. The user’s clarifying answers are in the journal verbatim, so the resumed run sees exactly the same context.

This paid off the first time the agent crashed: an early bug took it out right after clarification. I fixed the bug, re-ran the same command, and it picked up at the start of planning instead of asking every question again.

6. How do you inspect what the agent did?

This is the same problem as resume, viewed from a different angle. If the agent’s full history is durably written somewhere readable, both problems are solved at once.

The journal is plain JSONL — cat it, jq over it, open it in any editor. Each line is one event with a kind (user, assistant, tool_result, phase) and the full content. You can see exactly what the model said, what it tried to do, what came back, and where it went next.

I added one structural thing on top: the run is split into three named phases (clarify → plan → execute), and each phase boundary is its own journal entry. That makes the log easy to skim — “show me what happened during the plan phase” is a one-liner — and it gave me a place to hang extra constraints on each phase, like a smaller tool whitelist or a stricter system prompt. The plan phase, for instance, can only write to one specific file, which makes the resulting plan a stable artifact the user can read and even edit before execution starts.

Internals: what’s inside `.spek/`

Most of the agent’s state lives in a single hidden directory inside the working directory. After a successful run, <workdir>/.spek/ looks like this:

<workdir>/
├── SPEC.md                   # the user's spec (untouched by the agent)
└── .spek/
    ├── journal.jsonl         # full event log
    ├── plan.md               # the plan produced by phase 2
    └── command_config.json   # how to build / test / lint / run the project

There is no database, no global state, no ~/.config/spek — everything is scoped to the working directory. Wipe .spek/ and the next run starts from scratch (spek build --fresh does exactly this). Copy the working directory to another machine and the run is reproducible there.

journal.jsonl — append-only, one JSON object per line, fsync’d after each write. Each entry has a kind (user, assistant, tool_result, or phase) and the full content of that event. This is both the resume log (replay it to reconstruct the conversation) and the audit log (read it to see exactly what the model did and why). It’s the single source of truth for the run.

plan.md — written exactly once, during the plan phase, by the model itself. It’s a numbered checklist of steps the agent intends to execute. With --confirm-plan, the run pauses after this file is written and waits for user approval — and because it lives on disk as plain markdown, the user can edit it before approving. During the execute phase, the model is expected to mark steps [x] as it completes them, so reading plan.md during a long run tells you how far along it is.

For the weather-converter run shown earlier, the final plan.md looks like this — every step ticked [x] because the run terminated green:

# Plan

1. [x] Create pyproject.toml with project metadata, dependencies (pytest, ruff, black), and console script entry point `weather-converter`
   - Package name: `weather_converter`
   - Console script: `weather-converter = "weather_converter.cli:main"`
   - Dev dependencies: pytest, ruff, black
2. [x] Create package skeleton with `weather_converter/__init__.py`
3. [x] Implement conversion logic in `weather_converter/converter.py`
   - F→C: `(value - 32) * 5 / 9`
   - C→F: `value * 9 / 5 + 32`
   - Same→Same: passthrough
   - Output rounded to 2 decimal places
4. [x] Implement CLI entry point in `weather_converter/cli.py` using argparse
   - `--from` (dest: `from_unit`): required, choices `F`, `C`
   - `--to`: required, choices `F`, `C`
   - Positional `value`: float
   - Prints converted value to stdout with 2 decimal places
5. [x] Write unit tests for conversion logic in `tests/test_converter.py`
   - F→C (e.g., 32→0.00, 212→100.00)
   - C→F (e.g., 0→32.00, 100→212.00)
   - Passthrough (F→F, C→C)
   - Negative values, fractional values
6. [x] Write CLI integration tests in `tests/test_cli.py`
   - Verify stdout output for known conversions
   - Verify `--help` works
   - Verify invalid arguments produce errors
7. [x] Add `tests/__init__.py` and `conftest.py` if needed
8. [x] Run build and tests and confirm green

Two details worth noticing here:

The two clarifications the user gave during the epistemic exchanges — “2 decimal places” and same-unit “passthrough” — show up verbatim in steps 3 and 4. The plan isn’t generic boilerplate; it’s shaped by the conversation that produced it.
The bullets under each step are commitments the model is making to itself before any code is written (exact module names, exact CLI flag names, the conversion formulae). That gives the execute phase something concrete to refer back to and makes drift across turns much less likely.

command_config.json — the language-specific commands the agent should use to operate on the project. For Python it looks like this:

{
  "language": "python",
  "package_name": "weather-converter",
  "build_command": "uv build",
  "test_command": "uv run pytest",
  "lint_command": "uv run ruff check .",
  "format_command": "uv run black --check .",
  "run_command": "uv run weather-converter --help"
}

This file is what makes the termination check work: the configured “build command” and “test command” the journal-walker is looking for are read from here, not hard-coded. It’s also the single place a future language profile (Node, Go, Rust) would have to populate, so adding support for a new language is mostly a matter of producing the right config rather than touching the agent loop.

The pattern across all three files is the same: state is plain text on disk, scoped to the working directory, readable with standard tools. That makes the whole agent debuggable with cat, jq, and git diff — which is what you want at 11pm when something has gone sideways.

Summary

To recap the questions and the answers:

Question	Answer
Spec format?	Markdown with a small set of required headers. Reject malformed input before any LLM call.
Which tools?	Five: `read_file`, `write_file`, `grep`, `bash`, `epistemic`. Small surface area beats a sprawling toolbox.
Where does it run?	Inside a Docker container, mandatory. No host fallback.
When is it done?	Build exits 0, tests exit 0, and at least one test was actually collected.
What if it crashes?	Every event is journaled to JSONL with `fsync`; resume just replays it.
How do you inspect a run?	Read the same journal. Plus three explicit phases for skimmability.

The thing that surprised me most building this is how much of “the agent” is just plumbing: a JSON-serialisable journal, a Docker container with a bind mount, a deterministic exit check, a small tool list. The LLM is the easy part to integrate. The hard part is giving it a small, predictable world to act in, and being honest about when it’s done.

Source and a runnable example are in the repo: github.com/mkarots/spek.