← Back

A classifier is mostly plumbing

At Flowstate we’ve been chewing on a problem that looks trivial until you try it: route each incoming request to the cheapest model that can actually handle it. To route a prompt you first have to work out what it wants, and prompts turn up as pure human chaos. Typos. A 400-line code dump with the real ask buried on the last line. “hey can u look at this”. You have to pull a clean intent out of that, and you have to do it in well under a millisecond, in-process, on a CPU, for nothing — a semantic judgement in a memory footprint smaller than a JPEG, while every blog post on the subject swears blind you need a rack of H100s.

There’s an industry-standard playbook for exactly this. Curate a golden dataset of 200–500 human-labelled prompts. Validate a teacher LLM against it and don’t proceed below an F1 of 0.90. Distil the teacher into a small student model — SetFit, a BERT variant, something with embeddings. Plot confusion matrices. Micro-benchmark p99 latency. Run a 48-hour shadow deployment before you let it touch a single real request. It’s a lovely playbook — the kind written by researchers with infinite compute and no production pager. It’s also a fantastic way to over-engineer yourself into a corner.

The actual job was narrow: tag each incoming prompt with a task type — parent → child, like engineering → fix_bug or data → spreadsheet_edit — so the proxy routes it somewhere sensible instead of sending everything to the most expensive model on the menu. CPU only, no GPU in the hot path, a lot of requests. What makes training your own classifier sane rather than mad is that Flowstate already sits on a mountain of real requests — they’re just unlabelled. That’s the one good use for a cheap teacher LLM. Not as a doorman taking a cut of every request; you point it at the pile once, let it label everything, and throw it away. Use the expensive thing exactly once.

The tollbooth paradox

The obvious objection, which kept me staring at the ceiling: skip the local classifier entirely. Call a cheap, fast model — Flash, Haiku, whatever’s at the bottom of the pricing table — let it tag the prompt, and be done by lunchtime. It’d be more accurate than anything I could train, too.

But that’s the exact trap the whole architecture exists to escape. The point of tagging a prompt before it hits a model is to spend less — send the easy stuff somewhere cheap, only pay frontier prices when you actually have to. If the router itself makes a paid API call on every single request, you’ve built a tollbooth in front of your tollbooth. You’re paying for a ticket to ride the bus just to ask the driver if it’s the right bus. You’ve burned the savings before you’ve earned them, bolted a network round-trip onto every request, and posted a copy of every prompt to someone else’s logs. A classifier running in-process on CPU you already own is free per request — not cheap, free — forever after it’s trained. So a clumsy 92% model that costs nothing can be worth more than a 99% one that bills you by the token. You want a fast, clumsy bouncer here, not a slow, expensive philosopher.

The catch is the word “trained”. A model like this is hungry — it wants far more labelled examples than I’d ever hand-label for fun. So the teacher earns its keep exactly once: label the pile offline, and the classifier runs for nothing ever after.

Before arguing about any of this, I did what I usually do: built a small rig and measured.

One confession first

I haven’t pointed the teacher at the real pile yet, so for this first pass I generated a stand-in. A few thousand synthetic prompts across eighteen task types, with deliberate ambiguity baked in so the classifier had something to actually get wrong — prompts like “look at the broken endpoint”, which is honestly fix_bug or debug and you can’t tell from the text. Roughly 2,800 to train on, 600 to test.

This matters, so I’ll say it loudly: synthetic data labelled by the thing that generated it is a rigged game. It tells you whether the pipeline works and how the approaches rank against each other. It does not tell you real-world accuracy. And it quietly flatters simple models, because templated text leaves lexical fingerprints that a bag of words hoovers up. Keep that in your back pocket; I’ll come back to it.

With that nailed to the door, on to the numbers.

The model is the boring decision

Five approaches, cheapest to fanciest. A plain TF-IDF bag of words into logistic regression. The same with character n-grams and an SVM. A hashing vectoriser into an SGD classifier. And the one the playbook actually wants you to build: sentence embeddings — a quantised BGE transformer — into a linear head.

ApproachLeaf accParent accp95 (1 req)p95 (under load)Size
TF-IDF + logreg0.9231.0000.18 ms0.17 ms336 KB
TF-IDF + char + SVM0.9261.0001.4 ms29.6 ms3.2 MB
Hashing (2²⁰) + SGD0.9261.00020.7 ms42.4 ms147 MB
Hashing (2¹⁸) + SGD0.9241.0004.1 ms6.8 ms37 MB
Embeddings + logreg0.9180.9976.0 ms20.8 ms131 MB

The accuracy column is the dull one. Every approach lands within a percentage point of the others, all clustered around 92%. The 131MB transformer — the one you’d write a design doc to justify — came last, and was the only one that couldn’t get the coarse parent label perfectly right. The 336KB bag of words, an idea older than most of the frameworks in your package.json, tied it and shipped in a third of a millisecond.

Latency is the column that isn’t flat — two orders of magnitude between top and bottom. The only axis these approaches really differ on is the operational one, and there the dumbest option wins outright.

The parent column hides a quieter result, sitting at a flat 1.000 for almost everyone. All the confusion lives within a domain — viz mistaken for data_analysis, market_research for web_research. Nothing ever confuses a spreadsheet for a bug report. If the proxy only needs the coarse domain, the cheapest thing on the list had already solved it and I could have gone home.

So if the model barely matters, what does?

The plumbing

Three things ate far more of my attention than the model ever did, and none of them shows up in any guide.

Footgun one: the hashing vectoriser will quietly eat 147 megabytes

The hashing model at 2²⁰ features clocked 20.7ms and 147MB of resident memory — for a task with eighteen classes. That’s a quarter of a second of latency and a small video file of RAM to decide whether someone typed “fix this” or “why is this broken”.

The cause is dull and entirely self-inflicted: a 2²⁰ feature space times eighteen classes is a dense coefficient matrix the size of a holiday photo album, and every prediction drags its hand across the whole thing. Drop the hash to 2¹⁸ and it gets five times faster and four times smaller for the same accuracy. The default value is the footgun, and nobody warns you it’s loaded.

Footgun two: the “works on my machine” mirage

The character-n-gram SVM looked beautiful in isolation: 1.4ms per request, best accuracy on the board by a hair. I almost committed it and went for coffee. Then I pointed eight concurrent workers at it and its p95 cratered to 29.6 milliseconds. A twenty-fold collapse the instant it had company.

Two reasons, both invisible to a single-request benchmark. To get probabilities out of an SVM you calibrate it, which quietly trains and runs three sub-models per prediction. And all that extra per-request CPU work piles straight into Python’s global lock, so requests politely form a queue in the burning building instead of flying past. The bag of words, doing almost no work per request, didn’t notice the load at all — 0.17ms under eight workers, the same as running on its own. Winning the median is nice. Keeping its head when everything’s on fire is the bit you care about at 3am.

The glue: truncation is the difference between 17% and 83%

This one made me feel like an idiot. I threw a stress test at the winning model: a 400-line Java blob with one crucial instruction tacked onto the last line — “now fix the bug that makes the total wrong”. The real shape of a prompt in a coding tool.

It scored 16.7%. It lost its mind and labelled almost everything refactor — because to a bag of words a wall of code is a refactor, and the one human instruction at the bottom drowned in the noise.

The fix wasn’t a fancier model or an embedding matrix. It was four lines that throw away everything but the last 200 characters and classify those.

Sixteen percent to eighty-three, just by changing what the model was allowed to see rather than which model it was. (Tail-only wins when the instruction’s at the end; head-plus-tail is the safer general bet, in case the ask is up top.) The model was fine all along. The plumbing wasn’t.

The other glue: teaching it to say “I don’t know”

A tagger that confidently mislabels is worse than one that abstains. Conveniently, the model already knew when it was guessing — its wrong answers on ambiguous one-word prompts (“debug”, ”?”) came with rock-bottom confidence. So I swept a confidence threshold: below it, route to a catch-all instead of guessing.

It’s a clean trade-off. Hold the threshold around 0.7 and you still answer 90% of prompts, accuracy on the answered ones climbs to 95.5%, and you catch 90% of the genuinely out-of-scope junk. Where you set it is a product call — how often you’re willing to shrug versus how often you’re willing to be wrong — and, again, nothing to do with the model.

Where this is rigged, and where it isn’t

Back to the confession. A sharp reader is already typing: your data is synthetic and lexically tidy, so of course the bag of words won — embeddings earn their keep on messy real prompts, which you never tested. That reader is right, and I’d take their bet. On real traffic — typos, half-thoughts, three languages in one sentence, the same intent phrased a hundred ways — I fully expect embeddings to pull ahead. The model bake-off, specifically, is the part of this you should trust the least.

The honest test is sitting right there: point the teacher at the Flowstate pile, label it once, and rerun the bake-off on real prompts — that’s when you’d learn whether 92% was the data being kind or the approach being sound. If I get to it, it’s its own post.

The plumbing, though, doesn’t care what data you feed it. The hashing vectoriser eats 147MB regardless. The calibrated SVM collapses under concurrency on any input. Truncation decides whether the model sees the instruction at all. The confidence threshold is a property of the deployment, not the dataset. Those findings survive the caveat intact — and they were most of the work.

What I actually learned

  • Measure before you architect. I could have spent a week arguing about SetFit versus BERT. An afternoon with a script told me the model was the least interesting variable in the system.
  • Single-request benchmarks lie. The SVM looked best alone and worst under load. If your proxy serves concurrent traffic, benchmark concurrent traffic.
  • Defaults are footguns. A 2²⁰ hash space for eighteen classes is 147MB of nothing. Know what the knob does before you leave it where you found it.
  • Truncation is a model. Choosing what to show the classifier moved accuracy more than any architecture choice — 17% to 83% from four lines.
  • Let it abstain. A confidence threshold turns “sometimes confidently wrong” into “usually right, occasionally honest”. That’s a dial worth having.
  • The boring baseline is the thing to beat, not the thing to skip. Start with 336KB and a straight face. Reach for the 131MB transformer when you’ve got real data proving you need it — and not one commit before.

The fancy playbook wasn’t wrong, exactly. It was just optimising the one decision that didn’t matter, and silent on the four that did. The whole industry will sell you an H100 cluster to read a prompt; the thing that actually moved the needle was four lines of string slicing. Most of a classifier is plumbing — and plumbing doesn’t photograph well, which is probably why nobody writes the guide for it.

The whole rig is about 600 lines of Python. It’s work code, so I’m not putting it up — but everything that matters is in this post: the five approaches, the concurrency test, the truncation fix, the threshold sweep. Given the caveat, holes are exactly what I’m after, so aim at the method.