First, the honest framing: we are not on the Terminal-Bench leaderboard yet. We're working toward a verified, official submission — and until we get there, we're running our own controlled comparisons and sharing them transparently. This post is our first one: Atlarix against opencode on Terminal-Bench 2.0 (89 hard, multi-step terminal tasks), holding the model, provider, and infrastructure constant so the harness is the only variable. It is preliminary, single-attempt, and — to be very clear up front — it does not prove Atlarix is ahead of anyone.

Harness	Resolved	Score
Atlarix	42 / 89	47%
opencode	39 / 89	44%

What we're NOT claiming

In our single run, Atlarix resolved 42 of 89 tasks and opencode resolved 39 — three more for Atlarix. We are deliberately not calling that a win. The official Terminal-Bench leaderboard requires five attempts per task (k=5); we ran one (k=1). A 3-task difference at single-attempt is well within run-to-run noise, so it establishes nothing about who is "ahead." The most we read into it: on this model, Atlarix is competitive with a leading harness. That's the whole claim.

Why even a preliminary run is worth something

Atlarix is a harness — the model supplies the intelligence, so the question that matters is whether the harness lets an open model express its full ability. To test that fairly, the harness has to be the only thing that changes. So we pinned everything else: both ran MiniMax-M3 routed to a single fixed provider at fp8 (identical weights and precision per request), on the same container infrastructure (Modal), with native function-calling forced on each side and matched per-task timeouts. Atlarix ran through its real agent loop — an Electron-free build of the same code that ships in the app, not a benchmark-only reimplementation. Under those identical conditions, the open model performed about as well under Atlarix as under opencode. That's the signal: the harness isn't getting in the model's way.

Full disclosure: the one change we made to participate

In the Atlarix desktop app, the agent asks your approval before every file write and command — a core safety feature. A benchmark runs unattended, with nobody to approve anything, so to run at all we grant that approval once, up front, via an explicit operator flag (ATLARIX_AUTONOMOUS_DANGER=1). Without it, every task that needs an install, a cleanup, or a privileged command is simply blocked and fails. This is the only deviation from the app's default behavior, and it is not an edge over the other harness — every agent auto-approves to run an automated benchmark; it's inherent to running unattended. The flag is off by default; the interactive app always asks. We're stating it plainly so the setup is fully transparent.

Reproduce it — and check the raw files

Nothing here is hand-typed. The full Harbor job results (per-task pass/fail for both harnesses) are published on our benchmark page for download. The Atlarix bundle we ran is public — the same Electron-free headless build, available as a release tarball — and the benchmark runs on the open-source Harbor framework. We used the terminal-bench-2 dataset (all 89 tasks), single attempt, native timeout, on Modal, with the model pinned to one fp8 provider for both harnesses. The exact commands and download links are on the benchmark page.

Where we are, and what's next

We're at the start, not the finish. The goal is a full five-attempt (k=5) run for a verified Terminal-Bench leaderboard submission. In the meantime we're running the same head-to-head on more open models (DeepSeek, Kimi, Qwen) so nothing rests on a single model, and we'll add other benchmarks (SWE-bench and beyond). Every result will be published the same way: same model, same infrastructure, raw files attached, honest framing — including when a result doesn't go our way.

We're not on the leaderboard yet, and a single-attempt run proves nothing about being ahead. What it does suggest is that Atlarix is in the same class as a leading harness on this open model — the harness doesn't hold the model back. We'll keep testing, keep showing our work, and keep pushing toward the official, verified result.

Our First Terminal-Bench Test: Atlarix vs opencode (Preliminary)

What we're NOT claiming

Why even a preliminary run is worth something

Full disclosure: the one change we made to participate

Reproduce it — and check the raw files

Where we are, and what's next