The Harness Matters More Than the Model
Extract Data Podcast, Episode 7 — Reflections on models, system prompts, and the infrastructure layer nobody talks about enough
Apple podcasts link - Direct listen
There's a phrase from this week's conversation that I keep coming back to: "flirt with all the models, but marry one harness." It came from Ayan, and it probably captures the single most important practical takeaway from our discussion better than anything else I could say.
In Episode 7, Neha, Ayan, and I went deep on where the AI tooling landscape is actually heading — not the flashy benchmark numbers, but the quieter, more structural question of where value actually accumulates when you're building with these models day to day. We also got into our current model rankings, some honest reflections on context management, and the geopolitics of open-weighted models. A lot to unpack.
What Even Is a Harness?
We spent a good chunk of this episode making sure everyone — including Neha, who asked exactly the right beginner questions — understood what a harness actually is, because it's a term that gets thrown around without much explanation.
The short version: a harness is the environment from which you invoke a model. It's not the model itself. Claude Code is a harness. Codex is a harness. Open Code, Pear AI — these are all harnesses. The model is the engine; the harness is the chassis, the wheels, the dashboard, and the safety systems.
Ayan put it well: think of it as a small personal operating system for working with AI. When you spin up Claude Code, everything it knows about your project — the tools it can call, the MCP servers it has access to, the files it's allowed to read or modify — that's the harness doing its job. The model underneath is almost secondary to that configuration.
This framing matters more than it might seem, because most of the discussion in AI communities focuses almost entirely on the model layer. Which model is smarter? Which scored higher on some benchmark? But as the gap between models closes — and it's closing fast — the harness becomes the real differentiator.
System Prompts: The Invisible Guardrail
The harness conversation naturally led us into system prompts, which are worth understanding even if you never write one yourself.
When you make any API call to a language model, there are effectively two layers of instruction: the user prompt (what you're asking it to do right now) and the system prompt (a set of prior instructions that shape how the model behaves before it even sees your request). Harness providers write and maintain the system prompt on your behalf. When you're using Claude Code, for example, there's a substantial system prompt already in place — one that includes things like "you cannot write to a file without reading it first."
That might sound like a small constraint, but it's actually a meaningful safety net. Ayan made the point that when he's testing a brand-new model with zero track record — say, something that just appeared on Hugging Face — running it inside Claude Code's harness gives him some confidence that it won't do something catastrophic to his file system. The system prompt contains the guardrails that the model would otherwise have no knowledge of.
Compare that to making a raw API call with no system prompt at all. In that case, the model is fully stateless: it has no knowledge of your environment, your constraints, or your preferences. It'll do its best with what you give it, but you've removed all the scaffolding. That's fine in controlled situations — and sometimes it's exactly what you want — but it's worth understanding that the absence of a system prompt is itself a decision.
This came up in the context of Ayan's loop engineering work, where a repair agent is given a broken Scrapy spider and asked to fix it. Without a system prompt explicitly saying "fix the spider, not the HTML," there's a non-trivial chance the model decides the easiest path is to rewrite the HTML so the spider works — which is obviously useless for scraping a real website you don't control. The system prompt closes that ambiguity.
Harness Philosophy: Safety vs. Control
Ayan raised an interesting tension here. Claude Code's system prompt is famously large — people have complained about how much context window it consumes. But that size is the point: it's load-bearing. The guardrails are in there precisely because they need to be there.
Smaller, leaner harnesses like Pear AI trade some of that safety net for efficiency and flexibility. That's a legitimate trade-off depending on your use case, but it's worth going in with eyes open. The analogy I kept coming back to: it's like the difference between running a curated Linux distribution versus compiling everything from scratch. Full control is genuinely available; you just need to know what you're giving up.
There's also a broader architectural debate happening in the harness space right now. One school of thought favours multiple parallel agents running simultaneously. Another — and Ayan's firmly in this camp, with his background in Linux kernel development — favours sub-agents: spawning child processes that have their own independent context windows and report back only what's relevant. Parallel agents share a context window and get noisy; sub-agents stay scoped. It's forks versus daemons, essentially, just applied to AI orchestration.
The Model Rankings (As of Right Now)
With all that said about harnesses, we did still talk about the models themselves — because they're not irrelevant, just less decisive than the infrastructure around them.
Ayan's current ranking, based on his own testing:
- Fable 5 — his favourite, but unavailable to us without an NDA arrangement with Anthropic that isn't happening. A shame.
- GPT 5.5 — a significant leap from GPT 5.4, and accessible via Codex subscription without needing direct API spend. Surprisingly good value.
- Claude Opus 4.8 / GLM 5.2 — roughly equivalent at this level. GLM 5.2 is particularly strong for front-end and design work.
- Kimi 2.7, DeepSeek V4, Composer, Gemini 3.5 Flash — the tier below, each with specific strengths.
My own usage is a bit more prosaic. Day to day I lean on Sonnet 4.6 and Opus 4.8 because we have access to them. Outside of that, I've been using GLM 5.2 a fair amount for building and coding — it's thorough, sometimes almost too thorough, but very capable. Deep Seek V4 Pro is also a regular for me when I want a good price-to-performance ratio. And my Hermes agent runs on Deep Seek Flash, because all it needs to do is process text, respond, and save things — there's no point using a frontier model for that.
Neha's been spending most of her time in Codex lately, having drifted away from Claude Pro. Honestly, at the current price point, Codex is hard to argue with.
One genuinely fun data point: Ayan mentioned a site called inthewids.com that runs your name across multiple models to see which ones have training data about you. Deep Seek apparently knows quite a lot about me as a figure in the web scraping community. Nobody else got a hit on Deep Seek. Proof that lurking in the right corners of the internet does eventually compound.
The Token Cost of Being Thorough
One practical note that came out of both Ayan's and my experience with GLM 5.2: it's verbose. Extremely thorough, often to a degree beyond what's actually needed. There are tools like Caveman (which restricts output style to reduce token count) and Ponytail (which apparently adjusts both input and output for minimal, functional code) that can help with this.
The deeper point is that model cost isn't just a function of the model's base rate — it's also a function of how much output it generates. A cheaper model that produces twice the tokens can end up costing more than a pricier model that's more concise. Worth keeping in mind if you're running long agentic loops, which leads to...
Speed Doesn't Matter the Way You Think It Does
We started the episode by comparing GLM 5.2 to Claude Sonnet, and one of the first things I noticed was that GLM 5.2 felt slower. Ayan quickly pointed out that the slowness was almost certainly an Open Router routing issue — different providers offer wildly different inference speeds for the same model, and Open Router's default routing doesn't always prioritise speed.
But more interestingly: does it even matter? If you're using a model interactively, token-per-second speed is very real and very annoying when it's slow. But most of the work I'm building toward is automated — the model runs, does its thing, I come back to a result. In that context, a model that takes twice as long to run is almost never the bottleneck. This is especially true for the kind of multi-hour agentic workflows that Ayan mentioned are becoming more common, where a benchmark and testing mechanism just loops until a goal is reached.
Context Management, and Knowing When to Walk Away
Near the end we talked about context management hacks. I'll be honest: mine are non-existent. Heavy, inefficient context windows are my natural state. The closest thing I do is occasionally save a plan to a markdown file so I can use it as a clean handoff in a new session.
Ayan's approach is more systematic. He uses /compact and /clear to manage Claude Code sessions, and — importantly — asks the model to write a handoff message before clearing, summarising what was accomplished and what comes next. That message then seeds the next agent session. It's essentially a baton pass: no context is lost, but the window stays clean.
He also leans on sub-agents for research tasks. Rather than loading everything into one expanding context, he has the main agent spin up several Haiku or Sonnet sub-agents, each with their own independent context window, to fetch and process information in parallel. Only the relevant output comes back — not the noise.
Neha's answer was the most honest of all: she's at peak AI fatigue, wants to put down the tools, fill up her context window with actual humans and beaches, and remember what creativity felt like before it got outsourced. Which, honestly, is worth hearing. Even for those of us who are deep in this stuff, the signal-to-noise ratio of AI tooling can degrade your own thinking if you're not careful.
The Open-Weighted Question
There's a broader political dimension to all of this that we touched on: the US government apparently requested a staggered rollout of GPT 5.6, with individual user approvals. Ayan's quietly rooting for open-weighted models to win precisely because he doesn't want a government gatekeeping his access to inference.
The irony is real: for decades, open source was largely a Western project — Linux, Apache, the whole stack. Chinese labs were the closed ones. AI has inverted that completely. The best open-weighted models right now are coming out of Chinese labs (GLM, DeepSeek, Kimi), with permissive licenses and public weights. Meanwhile, frontier US models are getting more restricted, not less.
The gap is closing fast. Ayan's estimate: six months before open-weighted models are genuinely on par with state-of-the-art frontier models for most tasks. At that point, the question of self-hosting becomes much more interesting — even if true self-hosting (running a capable model on your own hardware without renting GPU time) is still a way off for most people.
His prediction for the near future: people will take these open-weighted base models and fine-tune them for specific use cases. GLM 5.2 for web scraping. GLM 5.2 for front-end design. Smaller, purpose-built variants that don't require racks of GPU memory to run. That's happening already on Hugging Face.
One Last Thing
We're going to be at EuroPython in Kraków. If you're there, come find us at the booth and mention the podcast — we'll sort you out with some swag. And the loop engineering meetup recording should be live in the end cards of this episode.
See you next week.
Extract Data is a weekly podcast on web scraping, data engineering, and the AI tools that are changing how we build. Hosted by John Rooney, Ayan Pahwa, and Neha Setia.






_HFpro5d6k3.png&w=256&q=75)
_E4PyVpfAxa.png&w=256&q=75)


-(1).png&w=1920&q=75)
-(1)_VZGHqxCgXV.png&w=1920&q=75)