_render_ids extracts input_ids from a BatchEncoding (5.x) or passes a list through (4.x); regression test asserts the BatchEncoding path yields the same mask; 10 tests pass; mypy strict clean. End-to-end box verification of the correct mask against Mistral done before this release. No submodule changes.
273 lines
9.6 KiB
Text
273 lines
9.6 KiB
Text
--ISSUE
|
|
Content-Type: application/sprints
|
|
Sprints:
|
|
|
|
--ISSUE
|
|
Content-Type: application/modules
|
|
Modules:
|
|
- Name: sekft
|
|
Path: .
|
|
|
|
--ISSUE
|
|
Content-Type: application/bugzilla
|
|
URL: https://bugs.code.tiararodney.com/rest
|
|
Mappings:
|
|
- Module: sekft
|
|
Product: sek
|
|
Component: sekft
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 1
|
|
Type: feature
|
|
Title: Package sekft as an installable namespace package
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Turn the flat trainer scripts into an installable tiararodney.sekft
|
|
namespace package: src layout, pyproject with the abstract
|
|
posix-sdc dependency and an optional gpu extra, console scripts, a
|
|
Pipfile pinning posix-sdc as a local editable override, and tox
|
|
environments.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 2
|
|
Type: feature
|
|
Title: SFT trainer with chat-template render and assistant-only mask
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Add the supervised fine-tuner: render trajectories through the
|
|
tokenizer's own chat template (matching serving), canonicalise
|
|
turns (fold system, merge consecutive), derive an assistant-only
|
|
loss mask by token-prefix differencing, and train a QLoRA adapter.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 3
|
|
Type: feature
|
|
Title: Behavioural evaluator
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Add the behavioural eval: load base plus LoRA adapter, drop it into
|
|
held-out scenarios with no scaffold, drive them through a local
|
|
operator that renders with the model's chat template, and report
|
|
reach/terminate/checker rates.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 4
|
|
Type: feature
|
|
Title: Resident-base train/eval harness
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Add the resident harness that loads the 14GB base once and keeps it
|
|
hot, training fresh LoRA adapters and evaluating them without
|
|
reloading the base, for the slow-OcuLink iterate loop.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 5
|
|
Type: feature
|
|
Title: Pipeline overview README
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Document the sekft pipeline: the trainer, evaluator, and resident
|
|
harness; how they consume the posix-sdc dataset; the render
|
|
contract; and how to run on the GPU box.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 6
|
|
Type: feature
|
|
Title: Test suite: unit and smoke
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Add a pytest suite: torch-free unit tests for the render
|
|
canonicalisation and assistant-only mask (fake tokenizer), and
|
|
smoke tests that the console entry points respond to --help without
|
|
the GPU stack.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 7
|
|
Type: feature
|
|
Title: Add GPL-2.0 license and drop the relocated Dockerfile
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: License sekft under GPL-2.0 (canonical text plus pyproject
|
|
metadata) and remove the dash Dockerfile, which now lives in
|
|
posix-sdc under docker/alpine-dash.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 8
|
|
Type: feature
|
|
Title: Refresh docs for the packaged trainer
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-16
|
|
Module: sekft
|
|
Relationships:
|
|
Description: The README still describes sekft as the data factory
|
|
(generate/rollout/dashdocker/taxonomy/schema), which all moved to
|
|
posix-sdc. Rewrite it as the trainer (sft/eval/resident) that
|
|
consumes posix-sdc, and update the module docstrings to
|
|
console-script invocations and the chat-template render contract.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 9
|
|
Type: feature
|
|
Title: Type-check the package under mypy strict
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-17
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Make the lint env honestly pass: add mypy as a dev dependency,
|
|
ignore_missing_imports for the ML libs, fully annotate
|
|
eval/resident/sft (including the inner operator callables), and
|
|
ship a py.typed marker so the Typing::Typed claim is real.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 10
|
|
Type: feature
|
|
Title: structured logging for the trainer (sft)
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-17
|
|
Module: sekft
|
|
Relationships:
|
|
Description: The trainer is nearly silent: outside an example count and a save
|
|
line it prints nothing through tokenizer load, the ~14GB base-model
|
|
load, example building, and the whole training loop, and
|
|
trajectories dropped for exceeding --max-len or having an empty
|
|
loss mask vanish without a trace. Add a small shared logging setup
|
|
(_log.py, stderr so stdout stays clean for results) and a module
|
|
logger; give sekft-train -v/--verbose and -q/--quiet. Log the run
|
|
config and each phase, report dataset accounting (keepers ->
|
|
usable, with counts dropped for length / empty-mask and a warning
|
|
when any are dropped), and raise transformers' verbosity during
|
|
training so the per-step curve shows. Apply to train() and
|
|
inspect().
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 11
|
|
Type: bugfix
|
|
Title: operate_rate can sum a None (eval + resident)
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-17
|
|
Module: sekft
|
|
Relationships:
|
|
Description: operate_rate computes sum(t.steps > 0 and t.meta.get('clean') for t
|
|
in rows). The 'and' yields the right operand when steps>0, so if
|
|
meta lacks the 'clean' key it yields None and sum() raises
|
|
TypeError at runtime; mypy (now that posix-sdc ships py.typed and
|
|
Trajectory is typed) flags the generator item type in eval.py:83
|
|
and resident.py:157. Wrap the predicate in bool() so it counts
|
|
trajectories that operated and are clean, fixing both the type
|
|
error and the latent crash.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 12
|
|
Type: feature
|
|
Title: load training data from a raw dir, a curated jsonl, or the Hub
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-17
|
|
Module: sekft
|
|
Relationships:
|
|
Description: iter_keepers reads only raw per-trajectory .json - one of three
|
|
input shapes the trainer should accept. Add load_turns(data, hub,
|
|
revision) that yields assistant-bearing turns from: a directory of
|
|
raw rollout .json (keep-filtered, today's iter_keepers); a curated
|
|
.jsonl corpus file (already keep-filtered, yield turns per line);
|
|
or the published corpus via posix-sdc's load_trajectories (local
|
|
data/ in a checkout, else the Hub). sekft-train gains --hub and
|
|
--revision; --data dispatches by dir-vs-.jsonl. Raw-rollout reading
|
|
stays sekft-local; curated+Hub reuse posix-sdc's loader (imported
|
|
lazily so the trainer needs neither posix-sdc nor huggingface_hub
|
|
for the raw/jsonl paths). Unit tests for the raw-dir and jsonl
|
|
dispatch.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 13
|
|
Type: feature
|
|
Title: reference posix-sdc three ways for seamless multi-machine dev
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-17
|
|
Module: sekft
|
|
Relationships:
|
|
Description: Wire the posix-sdc dependency as a triplet: the abstract
|
|
posix-sdc[hub] in pyproject (so the trainer's --hub path can reach
|
|
the Hub via huggingface_hub); the published wheel from the private
|
|
index in Pipfile [packages]; the git develop branch in Pipfile
|
|
[dev-packages] for develop-time. Commit Pipfile.lock so the
|
|
dependency surface and lock land together.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 14
|
|
Type: bugfix
|
|
Title: refresh Pipfile.lock against published posix-sdc 1.2.2
|
|
Status: done
|
|
Priority: medium
|
|
Created: 2026-06-17
|
|
Module: sekft
|
|
Relationships:
|
|
Description: The lock committed with the triplet (#13) predated the published
|
|
posix-sdc 1.2.2 wheel, so it could not pin the real [hub] closure.
|
|
Now that 1.2.2 is on the private index, re-lock: posix-sdc resolves
|
|
to ==1.2.2 from the index and the [hub] extra pulls huggingface_hub
|
|
and its transitive deps into the lock. Commit the refreshed
|
|
Pipfile.lock so the next machine installs the published wheel with
|
|
the Hub path available.
|
|
|
|
--ISSUE
|
|
Content-Type: application/issue
|
|
ID: 15
|
|
Type: bugfix
|
|
Title: apply_chat_template returns BatchEncoding on transformers 5.x
|
|
Status: done
|
|
Priority: high
|
|
Created: 2026-06-18
|
|
Module: sekft
|
|
Relationships:
|
|
Description: build_masked_example assumed apply_chat_template returns a flat
|
|
list[int] (transformers 4.x). On transformers 5.x it returns a
|
|
BatchEncoding ({input_ids: [...]}), so ids was a dict, len(ids) was
|
|
the key count, and the prefix-differencing spuriously raised 'chat
|
|
template is not additive' on every real model (verified against
|
|
mistralai/Mistral-7B-Instruct-v0.2). The masking logic is sound and
|
|
the Mistral template is additive; only the return type needs
|
|
normalising. Add a _render_ids helper that extracts input_ids when
|
|
the result is dict-like, and use it for both renders. The
|
|
fake-tokenizer test returned a bare list and missed this, so add a
|
|
BatchEncoding-returning fake and assert the mask matches.
|