Machine learning community has been stuck on the autoregressive bottleneck for years, but a new paper shows that it’s possible to use diffusion models to work on discrete at scale. The researchers trained two coding focused models named Mercury Coder Mini and Small that completely shatter the current speed and quality tradeoff.
Independent evaluations had the Mini model hitting an absurd throughput of 1109 tokens per second on H100 GPUs while the Small model reaches 737 tokens per second. They are essentially outperforming existing speed optimized frontier models by up to ten times in throughput without sacrificing coding capabilities. On practical benchmarks and human evaluations like Copilot Arena the Mini tied for second place in quality against huge models like GPT-4o while maintaining an average latency of just 25 ms. Their model matched the performance of established speed optimized models like Claude 3.5 Haiku and Gemini 2.0 Flash Lite across multiple programming languages while decoding exponentially faster.
The advantage of diffusion relative to classical autoregressive models stems from its ability to perform parallel generation which greatly improves speed. Standard language models are chained to a sequential decoding process where they must generate an answer exactly one token at a time. Mercury abandons this sequential bottleneck entirely by training a Transformer model to predict multiple tokens in parallel. The model starts with a sequence of pure random noise and applies a denoising process that iteratively refines all tokens simultaneously in a coarse to fine manner until the final text emerges. Because the generation happens in parallel rather than sequentially the algorithm achieves a significantly higher arithmetic intensity that fully saturates modern GPU architectures. The team paired this parallel decoding capability with a custom inference engine featuring dynamic batching and specialized kernels to squeeze out maximum hardware utilization.

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.
Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.
Rules:
1: All Lemmy rules apply
2: Do not post low effort posts
3: NEVER post naziped*gore stuff
4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.
5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)
6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist
7: crypto related posts, unless essential, are disallowed
removed by mod
Even if this ends up being a narrow domain speedup, it’s still massive, and coding tasks happen to be one of the big practical applications for LLMs. I can also hybrid approaches going forward, where specialized models end up being invoked based on the task at hand.