diff --git a/TODO b/TODO index f4392bf..981dfa7 100644 --- a/TODO +++ b/TODO @@ -191,3 +191,26 @@ Description: operate_rate computes sum(t.steps > 0 and t.meta.get('clean') for t and resident.py:157. Wrap the predicate in bool() so it counts trajectories that operated and are clean, fixing both the type error and the latent crash. + +--ISSUE +Content-Type: application/issue +ID: 12 +Type: feature +Title: load training data from a raw dir, a curated jsonl, or the Hub +Status: open +Priority: medium +Created: 2026-06-17 +Module: sekft +Relationships: +Description: iter_keepers reads only raw per-trajectory .json - one of three + input shapes the trainer should accept. Add load_turns(data, hub, + revision) that yields assistant-bearing turns from: a directory of + raw rollout .json (keep-filtered, today's iter_keepers); a curated + .jsonl corpus file (already keep-filtered, yield turns per line); + or the published corpus via posix-sdc's load_trajectories (local + data/ in a checkout, else the Hub). sekft-train gains --hub and + --revision; --data dispatches by dir-vs-.jsonl. Raw-rollout reading + stays sekft-local; curated+Hub reuse posix-sdc's loader (imported + lazily so the trainer needs neither posix-sdc nor huggingface_hub + for the raw/jsonl paths). Unit tests for the raw-dir and jsonl + dispatch.