Matching frontier LLMs with diverse small ensembles

Anyone running LLMs in production hits the same wall pretty quickly: top-tier API models like GPT-5 and Claude Sonnet are good and expensive; smaller models like GPT-4o-mini and Claude Haiku are cheap but more error-prone. The usual move is to pay for the bigger one wherever quality matters, even on requests the small one could have handled fine. You end up paying for a safety margin.

There’s an old idea from classical ML worth reaching for here: ensembles. Random forests, model averaging, mixture-of-experts — all variations on the same principle. Combine several estimators whose errors are uncorrelated and the combination beats any of the components alone. The math is decades old. What’s new is the economics: three small LLM calls in parallel cost a small fraction of one top-tier call — often an order of magnitude less — and finish in the time of the slowest of the three, not the sum. If a small ensemble matches top-tier accuracy on your workload, you’ve removed a multiplier from your bill.

The question is whether it matches — and what makes it match.

I built a small tool called emerge to investigate. It’s an OpenAI-compatible proxy: applications point at it, change nothing else, and each request fans out to several configured models in parallel, gets combined by an aggregator, and returns one answer. On a 150-case mixed-task benchmark (MMLU multiple-choice reasoning, GSM8K grade-school math, TriviaQA factual recall), a 3-model ensemble landed at:

ConfigurationAccuracyCostLatency
3-model ensemble (emerge)74%$0.0421,771 ms
Claude Sonnet74%$0.272,617 ms
GPT-573%$0.544,307 ms

Same accuracy band as Sonnet, slightly above GPT-5, at roughly 6× and 13× less cost. Faster too.

cost per 150 evals (lower is better) emerge $0.042 Sonnet $0.27 GPT-5 $0.54 $0 $0.14 $0.28 $0.42 $0.56

But the headline number is the easy part of the story. The interesting part is the constraint: this only works when the models in the ensemble are genuinely diverse. Three copies of the same base model give you nothing. Three small models from the same provider give you very little. The win comes from independent error modes — and you only get those by spanning providers and architectures.

The rest of this post is about why that constraint matters, what the proxy actually does, and where ensemble inference does and doesn’t make sense.

What emerge does

Under the hood, the proxy dispatches each prompt to N configured models in parallel, waits for all responses, runs them through an aggregator, and returns one combined answer.

client emerge proxy OpenAI Anthropic Ollama (local) aggregator solid arrows = parallel dispatch; dashed = responses returning for aggregation
Each request fans out to all configured providers in parallel. The aggregator combines responses; one answer returns to the client.

The configurable parts are which models to use, which providers to span, and which aggregation strategy to apply. Six aggregators are implemented:

Strategy is configurable too. The default is ensemble: all models run in parallel for every request. A cascade strategy orders models by cost and stops early if the cheapest returns a substantive answer — useful when most queries are easy and only a minority need the expensive option.

What the 150-case benchmark actually says

The numbers at the top come from a 150-case eval split across MMLU (multi-domain multiple-choice), GSM8K (grade-school math), and TriviaQA (factual recall). The mix matters. The headline ensemble doesn’t dominate everywhere — it wins decisively on MMLU and GSM8K, and like every model tested, struggles on TriviaQA’s obscure factual recall. The best-performing configuration was a confidence-routed 2-model ensemble; 3- and 4-model variants were also tested.

Two findings stand out:

More models is not better. A four-model ensemble underperformed the three-model version. Adding a marginal model adds noise faster than it adds signal once you’ve covered the major axes of variation.

Same-family ensembles do nothing. Three different Qwen variants — different sizes, different fine-tunes, same base — showed no measurable improvement over the single best Qwen. The errors were too correlated. Cross-provider ensembles, by contrast, showed real gains.

This is the part that’s easy to miss. The textbook framing is “combine N independent estimators.” In practice, models from the same training pipeline are anything but independent. They’ve seen the same data, they have the same biases, they fail the same way on the same examples. Ensembling them is averaging noise on top of identical signal.

Cross-provider ensembling — one OpenAI, one Anthropic, one open-weight — produces independent enough errors to actually push accuracy. Cross-architecture ensembling within a single provider works less well. Cross-fine-tune ensembling on the same base barely works at all.

Different aggregators win different tasks

The other useful observation from the eval is that no single aggregator dominates across task types. Confidence routing won TriviaQA-style retrieval. Embedding consensus won MMLU-style multiple-choice. LLM-as-judge won when the judge model was strong and lost when the judge was a small local one — judge quality is the ceiling.

That’s the case for the adaptive strategy: pick the aggregator based on what the responses themselves look like. High agreement across short responses → confidence routing is fine. Low agreement across long responses → embedding consensus or merged. The cost is one inspection pass over the responses before the final aggregation step. In return you stop having to commit to one strategy across heterogeneous workloads.

When this earns its keep

Ensemble inference is not a free lunch. It makes sense when:

It’s not worth it when:

Why diversity matters most

What emerge has clarified for me is that diversity is what the whole approach depends on. Not model count. Not model size. Not aggregator sophistication. The ensemble works exactly insofar as the underlying models fail differently on the same inputs.

That has implications beyond the proxy. It says cross-provider redundancy isn’t just a hedge against outages; it’s an accuracy lever. It says that as the major LLM providers converge on similar training data and similar architectures, the ensemble dividend will shrink — which is an argument for keeping at least one differentiated model (a local open-weight one with a different training mix, say) in the mix on principle. It says the LLM stack is starting to look like other systems where diversity-of-substrate is a virtue: compute that spans CPU vendors, storage that spans availability zones, libraries that don’t trust one upstream.

Same shape, recurring. Code is at github.com/jnormore/emerge.