Mercury: Ultra-Fast Language Models Based on Diffusion

☆ Yσɠƚԋσʂ ☆ to [email protected]

English

Mercury: Ultra-Fast Language Models Based on Diffusion

☆ Yσɠƚԋσʂ ☆ to [email protected]

English

We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at https://platform.inceptionlabs.ai/ and free playground at https://chat.inceptionlabs.ai

Machine learning community has been stuck on the autoregressive bottleneck for years, but a new paper shows that it’s possible to use diffusion models to work on discrete at scale. The researchers trained two coding focused models named Mercury Coder Mini and Small that completely shatter the current speed and quality tradeoff.

Independent evaluations had the Mini model hitting an absurd throughput of 1109 tokens per second on H100 GPUs while the Small model reaches 737 tokens per second. They are essentially outperforming existing speed optimized frontier models by up to ten times in throughput without sacrificing coding capabilities. On practical benchmarks and human evaluations like Copilot Arena the Mini tied for second place in quality against huge models like GPT-4o while maintaining an average latency of just 25 ms. Their model matched the performance of established speed optimized models like Claude 3.5 Haiku and Gemini 2.0 Flash Lite across multiple programming languages while decoding exponentially faster.

The advantage of diffusion relative to classical autoregressive models stems from its ability to perform parallel generation which greatly improves speed. Standard language models are chained to a sequential decoding process where they must generate an answer exactly one token at a time. Mercury abandons this sequential bottleneck entirely by training a Transformer model to predict multiple tokens in parallel. The model starts with a sequence of pure random noise and applies a denoising process that iteratively refines all tokens simultaneously in a coarse to fine manner until the final text emerges. Because the generation happens in parallel rather than sequentially the algorithm achieves a significantly higher arithmetic intensity that fully saturates modern GPU architectures. The team paired this parallel decoding capability with a custom inference engine featuring dynamic batching and specialized kernels to squeeze out maximum hardware utilization.

You must log in or register to comment.

HotTopNewOld

Chat

@[email protected]

5•17h

removed by mod

☆ Yσɠƚԋσʂ ☆

creator

1•19h

Even if this ends up being a narrow domain speedup, it’s still massive, and coding tasks happen to be one of the big practical applications for LLMs. I can also hybrid approaches going forward, where specialized models end up being invoked based on the task at hand.

Technology

[email protected]

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

1 user online
20 users / day
122 users / week
523 users / month
1.5K users / 6 months
1 subscriber
4.73K Posts
51.8K Comments
Modlog

mods:
@[email protected]

Mercury: Ultra-Fast Language Models Based on Diffusionplus-square

Mercury: Ultra-Fast Language Models Based on Diffusionplus-square

Technology

Mercury: Ultra-Fast Language Models Based on Diffusion

Mercury: Ultra-Fast Language Models Based on Diffusion