Why I'm adding GLM-5.2 to my agentic coding arsenal

I have a habit that occasionally annoys the people who pair with me: I am always swapping models in and out of my stack. The setup I described in my agentic coding setup already mixes vendors freely, with Claude doing the heavy planning and execution while a separate orchestration runs DeepSeek as the coder and Qwen as the reviewer, and I keep an OpenRouter key with loaded credits around precisely so I can throw a new model at real work the week it lands without committing to anything. When z.ai open-sourced GLM-5.2 couple days ago, a wave of "this beats the frontier" posts went up across the forums, and rather than believing whether the headlines are true I decided to test it myself, because my workflow will have to outlive any new model and there has been a lot of such launches lately.

How I evaluate a new model

What I trust is putting the model on the work I genuinely do and watching it behave, which for me means web scraping: generating Scrapy spiders and selectors, extracting structured data from messy HTML, tool-calling and planning the shape of a scraping project before any code gets written.

I send the identical prompt to the new model so nothing is biased by wording, I price every response at that model's published per-token rate to get a directional cost picture rather than an accounting-grade one, and I reach the open model through OpenRouter because that is one of the easiest way to trial anything new while the Claude models run on my normal account. Where I can score a result objectively I do, checking whether generated code compiles, whether extracted JSON is valid, and whether it covers the schema I asked for, and then I read every single output myself, because the automated scores lie more often than you would like.

What GLM-5.2 is, sourced

GLM-5.2 is a 753-billion-parameter Mixture-of-Experts model with around 40 billion parameters active per token, a one-million-token context window, and an MIT license, per its model card. On the one independent yardstick available, Artificial Analysis's Intelligence Index, it scores 51, which makes it the top open-weight model and fourth overall behind Claude Fable 5, Opus 4.8, and GPT-5.5, and it posts about 78% on Terminal-Bench 2.1, as Artificial Analysis reported. Its own published benchmarks trail Opus 4.8 on essentially everything, usually by a narrow margin that widens on the hardest long-horizon tasks, and you should note that press coverage widely cites a 744-billion-parameter figure while the official card says 753 billion, a small discrepancy I am flagging rather than papering over.

You will also read that it was trained entirely on Huawei Ascend chips with zero NVIDIA hardware, which I would treat with care, because Z.ai never made that claim for GLM-5.2 specifically, there is no technical report, and the figure of 100,000 Ascend chips is inherited from the broader GLM-5 family and remains unaudited, with skeptics at The Register calling a related claim "sophistry." It is probably directionally true and it would matter a great deal if it were, but I am not going to assert it as an established fact.

In frontend benchmarking it even beat Opus and is just next to Fable5, but since I don’t do much frontend, I didn’t test it in that department.

What it did on my work

On execution, GLM-5.2 reached parity with Sonnet, and that genuinely surprised me. The extraction tasks were a dead heat, with both models hitting 100% schema coverage and correctly pulling every item from a listing page as valid JSON, and the spiders were all idiomatic and all compiled, using Scrapy conventions like response.follow for pagination alongside sensible selectors and regex price cleaning. For everyday scraping code I could not reliably tell which output came from which model without looking at the filename.

On planning, the part where I normally spend the expensive Opus tokens, it came in roughly 8.5 times cheaper, which is where the cost picture gets hard to ignore.

Task	GLM-5.2 cost	Current model	GLM cheaper by
Project planning	$0.013	Opus 4.8 ($0.110)	8.5×
Selector generation	$0.004	Sonnet 4.6 ($0.015)	4.2×
Extraction to JSON	$0.019	Sonnet 4.6 ($0.036)	1.9×
Simple spider	$0.005	Sonnet 4.6 ($0.006)	1.1×

Across the whole suite the totals came to $0.068 for GLM-5.2, $0.122 for Sonnet, and $0.212 for Opus, which tells a clear story as long as you read the next section before

Where it loses, because it does

If I stopped at the cost table I would be selling you something, so here are the two real problems I hit. The first is that GLM-5.2 is verbose, and verbosity is not free, because it spent 1,144 output tokens writing a spider that Sonnet completed in 357, and it used 69 tokens simply to reply with the word "PONG." Its per-token price is low enough that it still wins on cost, but the verbosity erodes the headline advantage and makes the model noticeably slower, taking 24 to 55 seconds against Sonnet's 8 to 20. If you do adopt it, tame this at the source by turning its reasoning effort down and prompting for terse, code-only output, and although the various "talk like a caveman" skills such as caveman package the same terseness trick for Claude Code, it is worth understanding that they shape output style and will not trim the hidden reasoning tokens where much of GLM's bloat actually lives, which is exactly what the native reasoning-effort setting controls.

The second problem is that its knowledge of our own API was staler than Sonnet's. On the task that wired a spider to Zyte API, neither model reached for Zyte's current one-line addon setup and both fell back to the older manual middleware wiring, but Sonnet's configuration would actually run while GLM's was incomplete and would error on startup, because it omitted the required asyncio reactor and request-fingerprinter settings. The caveat that matters here is that I gave neither model web access in this test, so this measures training knowledge rather than live documentation lookup, and defaulting to an outdated pattern is the textbook symptom of training-data lag. Drive either model inside an agentic loop with documentation-fetching tools, which is how you would really run it in anger, and it can pull the current docs, a shift my colleague unpacks in what's becoming of web scraping developers in the age of AI agents. Stripped of tools, though, Sonnet was simply more current.

Running it inside a harness, which is where I actually work

Raw completions are only half the picture, because my real work happens inside Claude Code and OpenCode, where the model has to drive a loop rather than answer once: call a tool, read the result, decide the next step, and know when not to bother. So I ran a second probe aimed squarely at that, and GLM-5.2 handled everything I threw at it, emitting a correct web search call with valid arguments, choosing the file-reading tool over search when the task demanded it, continuing the conversation sensibly after I fed a tool result back, and showing the restraint to answer "what is two plus two" directly instead of reaching for a tool it did not need. Sonnet did the same, and the only visible difference was speed. That parity is the real headline for agentic use, because a model that writes beautiful code but bungles its tool calls is dead weight in a loop.

The cost gap held up inside that loop too, because running the whole tool-calling probe through OpenRouter brought GLM-5.2 in at roughly $0.002 against Sonnet's $0.013 at published rates, close to seven times cheaper, and the raw logs explain why: GLM stayed terse here, spending 20 to 30 output tokens per tool call rather than the rambling it produces on open-ended code, so its verbosity turns out to be task-dependent rather than constant. Two details in those logs are worth dwelling on. The Sonnet rows read $0.00 because I route Anthropic through my own key under OpenRouter's bring-your-own-key (BYOK) mode, which bills my Anthropic account directly rather than OpenRouter credits, so the honest comparison is the published-rate figure rather than the dashboard total. The speed column, meanwhile, is a lesson in provider routing, because GLM served from Z.ai's own infrastructure crawled along at three to five tokens per second while the same model served from a different provider ran at nearly 30, which means the latency I flagged earlier is as much about where you route the request as about the model itself.

OpenRouter request logs comparing Claude Sonnet 4.6 and GLM-5.2 on tool-calling requests, showing input and output token counts, cost, and tokens-per-second speed

Figure : My own OpenRouter logs for the tool-calling run. The Sonnet rows show $0.00 because they are billed through my own Anthropic key under BYOK, while GLM-5.2's genuine per-request cost and the wide speed gap between the Z.ai and third-party providers are both visible.

The practical surprise is that you do not have to leave your harness to use it, because Z.ai ships an Anthropic-compatible endpoint, so pointing Claude Code at GLM-5.2 is mostly a matter of environment variables, after which your MCP servers, skills, and hooks keep working unmodified, as Z.ai documents in its Claude Code guide. OpenCode, where I already run DeepSeek and Qwen, takes the same endpoint or an OpenRouter route through its provider config.

That [1m] suffix points at the model's defining feature in a harness, a one-million-token context window that is five times larger than its predecessor and big enough to hold a sprawling codebase or a long, messy scraping-agent trajectory without the harness summarizing it away. The window changes how you manage context rather than freeing you from managing it: you raise the auto-compact threshold so Claude Code uses the full window before it starts summarizing, and you lengthen the request timeout, because a verbose reasoning model can think for a long time before the first token arrives and the default timeout will kill the call mid-flight. The flip side is that the window is not free, since a model this wordy fills context faster than a terse one and a near-full million-token prompt is real money even at $1.40 per million input tokens, so the right mental model is a capability to wield deliberately rather than an excuse to stop pruning context.

My verdict

Good at: structured extraction from messy HTML, where it matched Sonnet on accuracy and schema coverage, and idiomatic everyday Scrapy code, both produced at a fraction of the cost while comfortably handling enormous inputs thanks to the one-million-token context, and crucially it drives a tool-calling loop as reliably as Sonnet, which is what makes it usable in a harness at all.

Weak at: brevity and speed, since it burns tokens and wall-clock time, and recall of fast-moving library specifics from memory alone, where its training lag showed against a more current Sonnet.

Where I would use it: as a cost-sensitive workhorse for high-volume, routine scraping work, ideally inside an agentic loop that can fetch current documentation to cover the staleness, and self-hosted whenever data residency rules out a hosted API.

What it can drop in for: Sonnet on standard execution tasks today, and Opus on routine planning when you want the dramatic cost saving and can tolerate the slower, wordier output, though I would not yet hand it the gnarliest long-horizon planning where Opus still pulls ahead, and it slots straight into Claude Code or OpenCode through Z.ai's Anthropic-compatible endpoint so adopting it never means abandoning the harness you already use.

Hedging your agentic setup

There is a strategic reason to keep an open-weight model in the arsenal beyond cost, and GLM-5.2 is not even the only candidate, because Kimi K2.6 (Moonshot), DeepSeek V4 (MorphLLM), and Qwen 3.5 (review) each beat or matched a frontier model on something this year, and all of them ship downloadable weights. The value of holding two or three of these wired in became vivid on June 13, 2026, when the US Commerce Department barred Anthropic from serving its top Fable 5 and Mythos class models to any foreign national and Anthropic disabled them globally to comply, as Fortune reported, the first export action aimed at an AI model rather than a GPU or chipset, almost right after the moment I unpacked using it in - where Fable 5 and web scraping fit into loop engineering.

A closed model can be rolled back, repriced, or legislated out from under you with no notice, whereas MIT-licensed weights already sitting on your disk cannot be revoked, and self-hosting them is also the cleanest answer to routing your data from infra you own or trust.

What this leaves me thinking about

I am not ripping Opus and Sonnet out of my workflow, because they are faster and more current and I still want the best planner I can get for the hardest problems, yet the exercise quietly moved a belief I did not realize I was holding. Swapping a frontier model for an open one on a meaningful slice of my work turned out to be a twenty-minute test rather than a migration project, and once that is true, loyalty to any single model is mostly untested habit dressed up as a standard. The durable advantage was never the model, because the leaderboard will reshuffle again before this post is a month old; it is the small, unglamorous harness that lets me find out, in an afternoon, what should run where. So the question I would leave you with is the one this experiment really answered for me: if your default model vanished tomorrow, banned, deprecated, or simply beaten by something open and cheaper, how quickly could you prove what should take its place? If the honest answer is that you do not know, then the most valuable thing you can build this quarter is not another spider, it is the test that answers this question for you.