diff --git a/README.md b/README.md index f5cfbc5..22e2e67 100644 --- a/README.md +++ b/README.md @@ -1,89 +1,79 @@ # sekft -Synthetic-trajectory generation for fine-tuning a model to operate a shell -as a self-directed citizen: land with **no imperative**, discover where -directives live, learn the provider from its own self-documentation, retrieve -the directives, execute them, and terminate (`exit` on success, `panic` when -genuinely blocked). +Fine-tune small open models to operate a POSIX shell as a self-directed citizen: +land with **no imperative**, discover where directives live, learn the provider +from its own self-documentation, do the work, and terminate (`exit` on success, +`panic` when genuinely blocked). -The dataset teaches a **mechanism, not a program**. Every axis of a scenario -is varied; only the four-step routine is held invariant: +sekft is the **training half**. The dataset and the synthetic-data factory live +in [`posix-sdc`](../posix-sdc) (`tiararodney.posix-sdc`), which this package +depends on. Here live the trainer, the behavioural evaluator, and the +resident-base harness. -1. **expect an announcement** of where directives are (motd / banner / env / file) -2. **understand the provider** via its self-documentation (`--help` / `man` / usage) -3. **retrieve** the directives -4. **execute**, then terminate +## Components -Bind the *convention* (there is an announcement at entry; tools are -self-documenting), free everything else. The model that learns this tolerates -an unstable userland because it re-learns the interface every session. +- **`sekft.sft`** (`sekft-train`) — supervised fine-tuner. Renders trajectories + with the tokenizer's own chat template and trains an **assistant-only** loss + mask (the commands plus the terminal token; environment turns masked to -100) + into a QLoRA adapter. Getting the mask wrong is the classic way to ruin a + shell-operator SFT, so it is the part tested hardest. +- **`sekft.eval`** (`sekft-eval`) — behavioural eval. Train loss says nothing + about whether the model operates the shell and leaves. This drops base + + adapter into held-out scenarios with no scaffold and reports the rates that + count: reach command-mode, terminate, checker passes. +- **`sekft.resident`** (`sekft-resident`) — resident-base harness. Loads the + 14 GB base once and keeps it hot, training and evaluating adapters without + reloading it (over OcuLink/PCIe the base transfer otherwise dominates every + run). -## Pipeline +## The render contract -``` -A. author generate.py model writes scenario bundles from the taxonomy - + ref-gate dashdocker.py run the bundle's own reference solution; admit only if its checker passes -B. rollout rollout.py scaffolded operator model acts in a fresh dash-in-docker container -C. verify rollout.py run the checker against container STATE (effect, not transcript) -D. record rollout.py strip the operator scaffold; save env<->action turns in deploy format -E. pairs [seam] rejects from B/C become DPO negatives against keepers from the same scenario -``` +The render the model trains on MUST equal what it is served with. The serving +harness (ccpty) sends structured `{role, content}` messages over the OpenAI +chat-completions protocol, so the endpoint applies the **model's own chat +template**. sekft therefore renders with `apply_chat_template`, after +`normalize_for_template` canonicalises each session: a leading `system` turn is +folded into the first `user` turn and consecutive same-role turns are merged, +because instruct templates such as Mistral's have no system role and require +strict user/assistant alternation. The same canonicalisation must run +serve-side, or train and serve diverge. -This repo implements **A-D** plus the execution backend (`dashdocker.py`). -Stage E (preference-pair assembly from the kept/rejected trajectories) is the -remaining seam; the rejects are already labelled by `outcome`/`keep`. +## Install -## Files - -- `taxonomy.py` - the axes of variation (task / provider / announcement / - doc-depth / difficulty) as pure data. No model, no container. -- `schema.py` - the `Scenario` bundle dataclasses + JSON (de)serialisation. -- `generate.py` - sample a combo, prompt a teacher model to author the bundle, - gate on the reference solution, write validated bundles to disk. -- `dashdocker.py` - the dash-in-Docker backend. `run(fixtures, script)` for the - one-shot reference gate; `session(fixtures)` for stateful rollouts, with - `Session.exec` (state-replayed), `.cwd()` (prompt building), `.check()` (Stage - C). Each command runs as its own `docker exec` (no tty buffering); cwd + - exported env are replayed between commands; `exit`/`panic` are intercepted as - terminals. -- `rollout.py` - Stage D. Rolls an operator model through a scenario in a fresh - container with only the disposable `SCAFFOLD`, records the turns - imperative-free (orientation + login + prompt/command/output, ending in the - terminal), verifies against final state, and classifies the outcome into a - `keep` decision. Multiple `--samples` per scenario for rejection sampling. -- `Dockerfile` - `sekft-dash`: alpine + dash, `/bin/sh` -> dash. - -## Run +The training paths only run on a CUDA host, so the GPU stack is an extra: ```sh -docker build -t sekft-dash . # the execution sandbox (once) - -SEKFT_MODEL=qwen2.5:32b \ # strong teacher via the litellm proxy -SEKFT_URL=http://localhost:4000/v1 \ -SEKFT_KEY=sk-litellm-dev \ - python generate.py --n 50 --out ./scenarios - -SEKFT_OP_MODEL=qwen2.5:32b \ # operator (teacher in round 1, student in STaR) - python rollout.py --scenarios ./scenarios --out ./trajectories --samples 3 +pipenv install # editable sekft + the local editable posix-sdc +pipenv install -e '.[gpu]' # torch / transformers / peft / datasets, on the box ``` -`rollout.py` writes one JSON per (scenario, sample) with the recorded turns and -a `keep` flag. The keepers are the SFT set; the rejects (labelled by `outcome`) -are Stage E's DPO negatives. Both stages run the model through the litellm -proxy; the rollout's container work is CPU/disk only. +`pyproject.toml` declares `tiararodney.posix-sdc` abstractly; the `Pipfile` +overrides it with the local editable `../posix-sdc` for side-by-side development. -When the `sekft-dash` image is present, `generate.py` runs each bundle's -reference solution in a fresh container and admits it only if its checker then -passes (real solvability gate). Without the image it falls back to a -**structural** dry-run that proves consistency, not solvability (`--no-docker` -forces this). The backend is verified end-to-end: `python dashdocker.py` runs a -self-test (fixtures, cwd/env replay, terminals). +## Use (on the GPU box) -## Non-negotiables (or the data rots) +```sh +# fine-tune an adapter on the posix-sdc trajectories +sekft-train --data ./trajectories --base mistralai/Mistral-7B-Instruct-v0.2 \ + --out ./ckpt --load-4bit -- **Reference-solution gate is mandatory** once the runner exists: never admit - a scenario whose own checker its reference solution cannot pass. -- **Verify effect, not claim**: the checker inspects container state. -- **Strip teacher prose** from recorded assistant turns (Stage D). -- **Balance terminals**: enough `empty-queue` and `blocked -> panic` scenarios - or the student learns "always exit success". +# inspect the assistant-only loss mask without training (runs anywhere) +sekft-train --data ./trajectories --base --inspect + +# behavioural eval on held-out scenario bundles (worlds, not trajectories) +sekft-eval --base --adapter ./ckpt --scenarios ./holdout --n 16 + +# resident loop: load the base once, cycle adapters without reloading it +sekft-resident --base --load-4bit +``` + +The eval consumes held-out **scenario bundles** from posix-sdc (it stands up and +verifies each in a fresh container), not trajectories. + +## Result + +Fine-tuning `mistralai/Mistral-7B-Instruct-v0.2` on the posix-sdc data lifted +clean termination on archetype-level held-out scenarios from **0/16 (base) to +9/16 (tuned)**: the operate-and-terminate mechanism generalised to unseen task +types, while task competence stayed archetype-local. See the experiment +[*From seed to weights*](https://blog.tiararodney.com/projects/2026/semantic-execution-kernel/experiments/from-seed-to-weights/). diff --git a/TODO b/TODO index f6f4c17..304309f 100644 --- a/TODO +++ b/TODO @@ -124,7 +124,7 @@ Content-Type: application/issue ID: 8 Type: feature Title: Refresh docs for the packaged trainer -Status: in-progress +Status: done Priority: medium Created: 2026-06-16 Module: sekft diff --git a/src/tiararodney/sekft/eval.py b/src/tiararodney/sekft/eval.py index b381385..4438134 100644 --- a/src/tiararodney/sekft/eval.py +++ b/src/tiararodney/sekft/eval.py @@ -6,14 +6,15 @@ scenarios with NO scaffold (the trained behaviour must stand on its own), and reports the rates that count: does it reach command-mode, does it terminate, does the checker pass. - python eval.py --base --adapter ./ckpt-mistral-r16 \ - --scenarios ./holdout-scenarios --n 10 + sekft-eval --base --adapter ./ckpt-mistral-r16 \ + --scenarios ./holdout-scenarios --n 10 -Reuses the rollout loop with a *local* operator: the model formats and -generates in the same role-delimited render it was trained on (train == eval == -deploy, or the prompts go out of distribution). Prerequisites on the box: torch -+ transformers + peft, the ``sekft-dash`` image, and held-out SCENARIO bundles -(from ``generate.py`` -- not trajectories; the eval stands up and verifies each). +Reuses the posix-sdc rollout loop with a *local* operator: the model renders and +generates with the same chat template it was trained on (train == eval == serve, +via ``apply_chat_template`` + ``normalize_for_template``, or the prompts go out +of distribution). Prerequisites on the box: torch + transformers + peft, the +``sekft-dash`` image, and held-out SCENARIO bundles from the posix-sdc factory +(not trajectories; the eval stands up and verifies each). """ from __future__ import annotations diff --git a/src/tiararodney/sekft/resident.py b/src/tiararodney/sekft/resident.py index c56979b..c295d48 100644 --- a/src/tiararodney/sekft/resident.py +++ b/src/tiararodney/sekft/resident.py @@ -8,14 +8,14 @@ fresh LoRA adapter on the resident base and ``unload``s it back to clean; each Interactive (IPython on the GPU box) is the intended use: - from resident import Resident + from tiararodney.sekft.resident import Resident r = Resident("~/llm-models/mistral-7b-instruct-v0.2", load_4bit=True) r.fit("~/sekft/trajectories", "~/sekft/ckpt-a", lora_r=16, lr=2e-4, epochs=3) r.evaluate("~/sekft/ckpt-a", "~/sekft/holdout", n=10) r.fit("~/sekft/trajectories", "~/sekft/ckpt-b", lora_r=32) # NO base reload -Or `python resident.py --base --selftest-data ` to prove the -base loads once and two adapters train against it. +Or `sekft-resident --base --selftest-data ` to prove the base +loads once and two adapters train against it. """ from __future__ import annotations diff --git a/src/tiararodney/sekft/sft.py b/src/tiararodney/sekft/sft.py index db5df3b..5ac8633 100644 --- a/src/tiararodney/sekft/sft.py +++ b/src/tiararodney/sekft/sft.py @@ -17,8 +17,8 @@ system role and require strict user/assistant alternation. That same canonicalisation must run on the serving side. Everything else is standard causal-LM SFT with an assistant-only loss mask. - python sft.py --data ./trajectories --base --out ./ckpt - python sft.py --data ./trajectories --base --inspect # mask stats, no training + sekft-train --data ./trajectories --base --out ./ckpt + sekft-train --data ./trajectories --base --inspect # mask stats, no training Training needs torch + transformers + peft (a GPU box). ``--inspect`` and the normalize/mask helpers run anywhere a tokenizer with a chat template is