What it is

An open contribution drive to publish permissioned coding-agent session traces (target: 20T tokens) via local redaction and public dataset uploads to Hugging Face.

Gabriel’s notes

Sybil Solutions’ 20T Session Data Drive is a public effort to aggregate permissioned coding-agent “sessions” (full work traces, not just code snippets) with a stated target of 20 trillion tokens. The site’s workflow centers on exporting local sessions (from tools like Pi, Codex, Claude Code, OpenCode, and Cursor), running local redaction, then publishing public dataset shards (e.g., on Hugging Face) with the right tags and dataset card details.

Quick take: This is the anti-benchmark benchmark. If you want coding agents that behave like they’ve actually shipped software (and not just passed toy tests), collecting real session traces is the obvious move—assuming you can do it without leaking secrets or your clients’ business.

The pitch, cleaned up: the goal is to assemble 20T tokens of public, permissioned coding-agent sessions—not scraped private repos and not synthetic “tasks” pretending to be work. Instead: real dev sessions showing tasks, mistakes, tool calls, edits, reviews, dead ends, fixes, and (crucially) the human intent around all of it.

What I like here is that it’s explicitly optimizing for the parts agents usually fail at in production: messy trajectories, wrong turns, tool friction, and recovery loops. Code shows what happened; sessions show how it happened and why.

I saved this under Research because the output isn’t a “better prompt” or a shiny new IDE—it’s training-grade evidence of how software actually gets built (warts and all).

Good fit if you want to:

Contribute open-source (or otherwise shareable) agent work traces that include tool use, tests, and iteration loops.
Build or evaluate coding agents on real workflows, not single-turn code completion.
Create datasets that preserve failures and recoveries (the expensive part of learning).
Standardize how you export, redact, format, and publish traces across multiple agent tools.
Pressure-test your redaction process before you ever hit “public upload.”

Pricing snapshot (auto-enriched):

No pricing is listed for the drive itself (it reads like an open contribution campaign). The recommended exporter (@0xsero/pi-brain) is MIT-licensed on GitHub. Unknown / not confirmed: whether there are incentives, bounties, or paid tiers tied to contribution volume.

Work-use / compliance snapshot (auto-enriched):

This is the part to take seriously. The rules are straightforward: only publish sessions you have the rights to share; run local redaction; then do a human review before any public upload. The “do not share” list includes secrets (keys/tokens/cookies/credentials), private customer work, proprietary repos, personal chats, and any third-party code/logs you can’t license. Net: treat this like open-source publication, not “AI telemetry.” If you wouldn’t paste it into a public GitHub repo, don’t upload it as a dataset shard.

Alternatives (auto-enriched):

SessionFS: Focuses on capturing and replaying coding-agent sessions as a portable record for team/agent memory; better if you want internal reuse/audit without necessarily publishing a public dataset.
0xSero/ai-data-extraction: A broader “extract everything” toolkit across multiple assistants; useful for data portability, but you’ll still need to design your own redaction + public-sharing policy and pipeline.

Before you adopt it:

Start with a repo you’d happily open-source today—then export a handful of sessions and manually spot-check what “redacted” actually means in practice.
Write a one-page internal policy: what’s allowed, what’s forbidden, who approves publication, and how you handle accidental leakage.
Decide your licensing posture up front (dataset card + license clarity matters if you want anyone to legally train on it).

Sources

https://training.sybilsolutions.ai/
https://training.sybilsolutions.ai/about.html
https://training.sybilsolutions.ai/rules.html
https://github.com/0xSero/pi-brain
https://sessionfs.dev/

Visit the resource

20T Session Data Drive (Sybil Solutions)