twin - turn your scattered agent sessions into a portable operating-profile
· 7 min read
ive been experimenting different approaches to bring together, the context across all my harnesses, which can be portable. the problem came from - i use hermes as my personal orchestrator (tried openclaw for a few months but took a shift - had to manually move the learnings to hermes), which uses claude code, codex as coding agents to do any task. mainly, 90% of my time spent at work is on claude code.
im sure most of the power users has a very similar setup.
ive realised, i have 3k+ sessions of these agents across the workspaces (both personal and work) and was wondering, whats the best way to gather all the learnings, memory, preferences, operating model, memory, etc.. i had, in these sessions into a single, maybe a SOUL.md - so, whenever i want to move to a new harness, i dont have to train them from scratch about me and how i want them to operate.
its not a data collector from all the sessions, its context engineering. the “pull” is engineering scattered lived data into context an agent can actually use.
north star - can i use this context artifact which has many facts on a project they’ve never touched
one more thing thats core to the design now - twin has zero runtime code. its not a tool the harness calls, its a playbook the harness executes. the harness already is an LLM with file tools sitting on top of all your data, so you dont ship it a parser, you ship it instructions: markdown skills + reference files. no api key, no model to pull, nothing runs from the repo.
the engine (the core pieces) #
harvest - no parsers anymore. a registry (
harnesses.md) tells the harness per harness: the sessions glob, which field holds the role, which field holds the text, plus a real sample line. that covers claude code, codex, hermes, gemini cli, aider, cursor, opencode, copilot - and the user-authored memory files of each (CLAUDE.md, GEMINI.md, .cursor/rules, windsurf memories..), which are the strongest signal there is when a lived session confirms them. for harnesses where the schema shifts across versions, theres a probe-first protocol: open one line, identify the fields yourself, or skip loudly. never guess silently.the gate: most of my sessions are workflows that run without my intervention, so a session is only included if it has more than 7 user messages - the user turns are where the signal lives, assistant verbosity says nothing about the person. secrets get scrubbed at read, and any fact that would carry one is dropped.
distill - pull out “how you operate” facts, from reading the whole trace of these sessions with each cited to verbatim quote from the transcript. it governs the climb from behaviour to principle, the three layers (mental model, operating habit, env) and eliminates the nouns from the facts (otherwise leads to context noise of the projects, which are not required). if the harness allows, the distill fans out as multiple parallel sub agents in batches.
each claim now carries a stable id, a condition for when it applies, the why behind it, and the cited evidence - appended to
claims.jsonl, the regenerable source of truth. and a yield rule i had to learn: a substantive session gives 1-5 facts, zero is valid. youre panning for gold, not summarizing.synthesize - cluster equivalent principles across all sessions, merge the duplicates. then a link pass draws edges between claims:
refines,depends_on, andconflicts_with- and a conflict is never averaged, it has to resolve into a conditioned pair (“skip tests when its a throwaway spike” / “require green CI on an existing system”). people arent inconsistent, theyre conditional.the categories the profile is grouped under are derived per person, not a fixed list - an earlier version forced engineering buckets and it means nothing for a pm or a cx agent or a salesperson. output is two files:
AGENTS.md(the portable operating model, 8-12 lines, installs global) andenvironment-ledger.md(your dated repos/tools, installs per-project, quarantined out of the model). the AGENTS.md also opens with a smallagent:preamble - so its an instruction to the agent reading it, not just a description of me.self-eval - the gate before it ships the context document.
i. portability % = fraction of operating model lines that pass the north star test (transferable, no proper nouns, etc..). computes it and reports the number.
ii. horoscope check - for every line: would the opposite be implausible for anyone? if yes its barnum filler (“values clear communication”) and gets sharpened or cut. day one should never be a horoscope.
deliver and install - do you want to deploy this operating model to global settings (.claude/CLAUDE.md for cc, and .codex/AGENTS.md for codex). its wrapped in markers so a re-run replaces the block instead of duplicating it. portable to any other harness you want to take it to.
the suite #
a profile built once decays into a horoscope, so twin is now four playbooks, not one:
profile- the full build aboveupdate- incremental refresh, distills only sessions newer than the last run and shows you the diff (never re-ingests its own output - that inbreeds the profile)audit- adversarial grading: re-opens the sources and verdicts every line (supported / overreach / unfounded), runs the barnum screen, finds unconditioned contradictions, gives a trust gradequery- “how do i usually handle x?” answered from the claims with citations and conditions, or it says plainly it doesnt know. other agents can use this to pull my operating context mid-task
shipped the whole package as plugin, which you install directly in your harness
steps:
# for claude code
/plugin marketplace add Hk669/twin
/plugin install twin@twin
/twin:profile # later: /twin:update, /twin:audit, /twin:query
# for codex
codex plugin marketplace add Hk669/twin
codex plugin add twin@twin
next steps: #
- compose the conditions into a decision tree an agent can walk, instead of reading every claim
- a
SessionEndhook that runs the update playbook on just the session that ended, and aquery_profileMCP tool as the always-on version of the query skill - build an eval system to validate and benchmark - three levels: extraction faithfulness, profile quality (barnum rate, portability), and downstream lift (does loading the profile measurably improve an agent on real tasks)
- more harness adapters (cline, roo, zed..) - each one is a few lines in the registry, no code
learnings: #
started off with actually building a server which can store all of these context pieces in an object store, which can be exposed as an MCP tool to any harness but realised for enterprise users, they wont be able to bring the context outside the workspace. so it has to independent for now, before i figure out the security around this.
thought i would introduce a system which would do the whole job, and you give the model provider key for tokens but then realised the harness they work on are already a killer for this, and spending extra money for them on this wouldnt encourage adoption.
v1 had a python harvester with a parser per harness. deleted all of it. the harness is already an LLM with file tools - shipping it a playbook beats shipping it a parser, and as the harnesses get more intelligent, instructions age better than code. the registry entry for a new harness is ~15 lines of markdown.
fixed categories were a trap. forcing a category per fact exploded into 50+ ad-hoc buckets, and an engineering taxonomy is meaningless for most people. derive a small set over the whole corpus, per person, and make every category hold more than one principle.
github repo - https://github.com/Hk669/twin, please feel free to contribute.
thanks!