docs: rewrite README for the packaged trainer

2026-06-16 23:49:01 +02:00 · 2026-06-16 23:49:01 +02:00 · a0b1fbc0c1
commit a0b1fbc0c1
parent 9cdb2bdc97
1 changed files with 63 additions and 73 deletions
--- a/README.md
+++ b/README.md
@ -1,89 +1,79 @@
 # sekft

-Synthetic-trajectory generation for fine-tuning a model to operate a shell
-as a self-directed citizen: land with **no imperative**, discover where
-directives live, learn the provider from its own self-documentation, retrieve
-the directives, execute them, and terminate (`exit` on success, `panic` when
-genuinely blocked).
+Fine-tune small open models to operate a POSIX shell as a self-directed citizen:
+land with **no imperative**, discover where directives live, learn the provider
+from its own self-documentation, do the work, and terminate (`exit` on success,
+`panic` when genuinely blocked).

-The dataset teaches a **mechanism, not a program**. Every axis of a scenario
-is varied; only the four-step routine is held invariant:
+sekft is the **training half**. The dataset and the synthetic-data factory live
+in [`posix-sdc`](../posix-sdc) (`tiararodney.posix-sdc`), which this package
+depends on. Here live the trainer, the behavioural evaluator, and the
+resident-base harness.

-1. **expect an announcement** of where directives are (motd / banner / env / file)
-2. **understand the provider** via its self-documentation (`--help` / `man` / usage)
-3. **retrieve** the directives
-4. **execute**, then terminate
+## Components

-Bind the *convention* (there is an announcement at entry; tools are
-self-documenting), free everything else. The model that learns this tolerates
-an unstable userland because it re-learns the interface every session.
+- **`sekft.sft`** (`sekft-train`) — supervised fine-tuner. Renders trajectories
+  with the tokenizer's own chat template and trains an **assistant-only** loss
+  mask (the commands plus the terminal token; environment turns masked to -100)
+  into a QLoRA adapter. Getting the mask wrong is the classic way to ruin a
+  shell-operator SFT, so it is the part tested hardest.
+- **`sekft.eval`** (`sekft-eval`) — behavioural eval. Train loss says nothing
+  about whether the model operates the shell and leaves. This drops base +
+  adapter into held-out scenarios with no scaffold and reports the rates that
+  count: reach command-mode, terminate, checker passes.
+- **`sekft.resident`** (`sekft-resident`) — resident-base harness. Loads the
+  14 GB base once and keeps it hot, training and evaluating adapters without
+  reloading it (over OcuLink/PCIe the base transfer otherwise dominates every
+  run).

-## Pipeline
+## The render contract

-```
-A. author      generate.py     model writes scenario bundles from the taxonomy
-   + ref-gate   dashdocker.py   run the bundle's own reference solution; admit only if its checker passes
-B. rollout     rollout.py       scaffolded operator model acts in a fresh dash-in-docker container
-C. verify      rollout.py       run the checker against container STATE (effect, not transcript)
-D. record      rollout.py       strip the operator scaffold; save env<->action turns in deploy format
-E. pairs       [seam]           rejects from B/C become DPO negatives against keepers from the same scenario
-```
+The render the model trains on MUST equal what it is served with. The serving
+harness (ccpty) sends structured `{role, content}` messages over the OpenAI
+chat-completions protocol, so the endpoint applies the **model's own chat
+template**. sekft therefore renders with `apply_chat_template`, after
+`normalize_for_template` canonicalises each session: a leading `system` turn is
+folded into the first `user` turn and consecutive same-role turns are merged,
+because instruct templates such as Mistral's have no system role and require
+strict user/assistant alternation. The same canonicalisation must run
+serve-side, or train and serve diverge.

-This repo implements **A-D** plus the execution backend (`dashdocker.py`).
-Stage E (preference-pair assembly from the kept/rejected trajectories) is the
-remaining seam; the rejects are already labelled by `outcome`/`keep`.
+## Install

-## Files
-
- `taxonomy.py` - the axes of variation (task / provider / announcement /
-  doc-depth / difficulty) as pure data. No model, no container.
- `schema.py` - the `Scenario` bundle dataclasses + JSON (de)serialisation.
- `generate.py` - sample a combo, prompt a teacher model to author the bundle,
-  gate on the reference solution, write validated bundles to disk.
- `dashdocker.py` - the dash-in-Docker backend. `run(fixtures, script)` for the
-  one-shot reference gate; `session(fixtures)` for stateful rollouts, with
-  `Session.exec` (state-replayed), `.cwd()` (prompt building), `.check()` (Stage
-  C). Each command runs as its own `docker exec` (no tty buffering); cwd +
-  exported env are replayed between commands; `exit`/`panic` are intercepted as
-  terminals.
- `rollout.py` - Stage D. Rolls an operator model through a scenario in a fresh
-  container with only the disposable `SCAFFOLD`, records the turns
-  imperative-free (orientation + login + prompt/command/output, ending in the
-  terminal), verifies against final state, and classifies the outcome into a
-  `keep` decision. Multiple `--samples` per scenario for rejection sampling.
- `Dockerfile` - `sekft-dash`: alpine + dash, `/bin/sh` -> dash.
-
-## Run
+The training paths only run on a CUDA host, so the GPU stack is an extra:

 ```sh
-docker build -t sekft-dash .              # the execution sandbox (once)
-
-SEKFT_MODEL=qwen2.5:32b \                  # strong teacher via the litellm proxy
-SEKFT_URL=http://localhost:4000/v1 \
-SEKFT_KEY=sk-litellm-dev \
-  python generate.py --n 50 --out ./scenarios
-
-SEKFT_OP_MODEL=qwen2.5:32b \              # operator (teacher in round 1, student in STaR)
-  python rollout.py --scenarios ./scenarios --out ./trajectories --samples 3
+pipenv install              # editable sekft + the local editable posix-sdc
+pipenv install -e '.[gpu]'  # torch / transformers / peft / datasets, on the box
 ```

-`rollout.py` writes one JSON per (scenario, sample) with the recorded turns and
-a `keep` flag. The keepers are the SFT set; the rejects (labelled by `outcome`)
-are Stage E's DPO negatives. Both stages run the model through the litellm
-proxy; the rollout's container work is CPU/disk only.
+`pyproject.toml` declares `tiararodney.posix-sdc` abstractly; the `Pipfile`
+overrides it with the local editable `../posix-sdc` for side-by-side development.

-When the `sekft-dash` image is present, `generate.py` runs each bundle's
-reference solution in a fresh container and admits it only if its checker then
-passes (real solvability gate). Without the image it falls back to a
-**structural** dry-run that proves consistency, not solvability (`--no-docker`
-forces this). The backend is verified end-to-end: `python dashdocker.py` runs a
-self-test (fixtures, cwd/env replay, terminals).
+## Use (on the GPU box)

-## Non-negotiables (or the data rots)
+```sh
+# fine-tune an adapter on the posix-sdc trajectories
+sekft-train --data ./trajectories --base mistralai/Mistral-7B-Instruct-v0.2 \
+            --out ./ckpt --load-4bit

- **Reference-solution gate is mandatory** once the runner exists: never admit
-  a scenario whose own checker its reference solution cannot pass.
- **Verify effect, not claim**: the checker inspects container state.
- **Strip teacher prose** from recorded assistant turns (Stage D).
- **Balance terminals**: enough `empty-queue` and `blocked -> panic` scenarios
-  or the student learns "always exit success".
+# inspect the assistant-only loss mask without training (runs anywhere)
+sekft-train --data ./trajectories --base <dir> --inspect
+
+# behavioural eval on held-out scenario bundles (worlds, not trajectories)
+sekft-eval --base <dir> --adapter ./ckpt --scenarios ./holdout --n 16
+
+# resident loop: load the base once, cycle adapters without reloading it
+sekft-resident --base <dir> --load-4bit
+```
+
+The eval consumes held-out **scenario bundles** from posix-sdc (it stands up and
+verifies each in a fresh container), not trajectories.
+
+## Result
+
+Fine-tuning `mistralai/Mistral-7B-Instruct-v0.2` on the posix-sdc data lifted
+clean termination on archetype-level held-out scenarios from **0/16 (base) to
+9/16 (tuned)**: the operate-and-terminate mechanism generalised to unseen task
+types, while task competence stayed archetype-local. See the experiment
+[*From seed to weights*](https://blog.tiararodney.com/projects/2026/semantic-execution-kernel/experiments/from-seed-to-weights/).