Volodymyr, congrats. This is crazy fast. If not super great at long context coding tasks. I tagged a few problem responses.
I'm curious about something that has analogues in image diffusion models -- you can see diffusion models, depending on how they are working through their latent space, sometimes try out and then move on from a feature in an image as it fits less with what's around it.
Are there analogues for Mercury? Does it try with a token or set of tokens, and as parts of the response fill in move on from them? Similarly, this architecture seems like it would have real problems inserting a needed token in the middle of a bunch of relatively high confidence generated tokens.
Can you give some insight / thoughts from the frontlines on these?
Good question! We are not open sourcing the models at launch time, but we have a roadmap of future releases in which we hope to make some of our models accessible to the research community.
Super cool, and I'd love to play around with this if they release an open source version.
Without a full paper, it's a bit hard to understand the full details. Does this essentially replace nucleus sampling with diffusion, or does it change the "core" transformer architecture in a major way?
Yes, we plan to be releasing a tech report soon. We are not open sourcing the models at launch time, but we have a roadmap of future releases in which we hope to make some of our models accessible to the research community.
Probably it's not relevant to you commercially at the moment (or ever?), but would love some intuition on how your models perform on really low end hardware. Does this technique translate into improved CPU-only performance? Also curious about density, does the technique require more/fewer/roughly same parameters as a traditional LLM for the same output quality?
Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.
Assuming the model tracks convergence in one way or another, it would simply continue performing iterations until it has reached an error below an epsilon value.
This means that in the worst case the number of iterations is the same as a classic autoregressive transformer.
So they are mostly taking advantage of the fact that the average response is in reality not fully sequential, so the model is discovering the exploitable parallelism on its own.
This is not too dissimilar to a branch and bound algorithm that has a worse theoretical runtime than a simple brute force search, but in practice is solving the integer linear programming problem in almost polynomial time, because not everyone is encoding the hardest instances of problems in NP as integer linear programs.
The short answer is that we do more than one parallel pass over multiple tokens: we iteratively refine them over a few passes to fix incoherences. This can be seen as a generalization of diffusion algorithms that underlie systems like Midjourney or Sora.
That's a good point. In this context, we've been using "commodity GPUs" to refer to standard Nvidia hardware, in contrast to specialized chips like Groq and Cerebras. While these chips also achieve fast speeds, they are not nearly as ubiquitous as Nvidia GPUs. We think that matching their performance on standard Nvidia hardware can make AI much more affordable. We also support any GPUs, not just H100's.
We're going to be releasing a tech report soon, stay tuned!
An h100 is absolutely not a commodity product. A commodity product is one that is fungible (and that interchangeability is also reflected by second order effects like price).
A 4090 is replacable with a 7900xtx. Not the same with an h100 and an instinct mi360
This work is great. May I ask the maximum token length supported by this model? During inference, do we process the maximum token length directly? in other words, what is the runtime estimate for different prompt length and different output length?
Does it mean coding tools making use of MDLMs can more precisely edit code chunks without prompting hacks? Seems like there is absolutely no downside, looking forward to all SOTA models switching to this technique soon!
Awesome stuff. I've read some of the probabilistic ML work from you guys and am stoked to see a competitive alternative to autoregressive decoders. Best of luck!
Volodymyr, can you comment on your organization's approach to safety? It would appear that your model will be a step-change in the capabilities/FLOP, which could change the scale of applications. Also being a novel architecture, have you studied how what new failure modes may exist or how existing alignment techniques may or may not be applicable?
As far as I understand (based on reading for less than 5 minutes) it is still a transformer model, but they simply start predicting tokens in random positions, with the possibility of updating existing tokens, rather than producing tokens from left to right.
That isn't too different from say good old stable diffusion.
GPUs compute in two dimensional vectors, and auto-regressive models compute in sequence. Diffusion seems to me to be the more efficient and natural fit for computation on GPUs.
When o3 solved the ARC[1] challenge, Sam Altman said it was recognizing the image pixel by pixel linearly, and i found it pretty hilarious that they start by using a chip which computes in two dimensions, they use it to compute a line, and that line is a two dimensional image of the ARC challenge.
This is Volodymyr, co-founder at Inception---let us know if you have any questions about diffusion, language modeling, and our new Mercury models!
Volodymyr, congrats. This is crazy fast. If not super great at long context coding tasks. I tagged a few problem responses.
I'm curious about something that has analogues in image diffusion models -- you can see diffusion models, depending on how they are working through their latent space, sometimes try out and then move on from a feature in an image as it fits less with what's around it.
Are there analogues for Mercury? Does it try with a token or set of tokens, and as parts of the response fill in move on from them? Similarly, this architecture seems like it would have real problems inserting a needed token in the middle of a bunch of relatively high confidence generated tokens.
Can you give some insight / thoughts from the frontlines on these?
It looks super cool, any plan on open sourcing something? Btw, looking for an AI solution/sale engineer :P
Good question! We are not open sourcing the models at launch time, but we have a roadmap of future releases in which we hope to make some of our models accessible to the research community.
Super cool, and I'd love to play around with this if they release an open source version.
Without a full paper, it's a bit hard to understand the full details. Does this essentially replace nucleus sampling with diffusion, or does it change the "core" transformer architecture in a major way?
Yes, we plan to be releasing a tech report soon. We are not open sourcing the models at launch time, but we have a roadmap of future releases in which we hope to make some of our models accessible to the research community.
Probably it's not relevant to you commercially at the moment (or ever?), but would love some intuition on how your models perform on really low end hardware. Does this technique translate into improved CPU-only performance? Also curious about density, does the technique require more/fewer/roughly same parameters as a traditional LLM for the same output quality?
Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.
May I ask if the training cost of the Mercury Code is higher compared to existing LLMs of similar capabilities?
How does producing tokens in parallel not just result in completely incoherent output?
Assuming the model tracks convergence in one way or another, it would simply continue performing iterations until it has reached an error below an epsilon value.
This means that in the worst case the number of iterations is the same as a classic autoregressive transformer.
So they are mostly taking advantage of the fact that the average response is in reality not fully sequential, so the model is discovering the exploitable parallelism on its own.
This is not too dissimilar to a branch and bound algorithm that has a worse theoretical runtime than a simple brute force search, but in practice is solving the integer linear programming problem in almost polynomial time, because not everyone is encoding the hardest instances of problems in NP as integer linear programs.
The short answer is that we do more than one parallel pass over multiple tokens: we iteratively refine them over a few passes to fix incoherences. This can be seen as a generalization of diffusion algorithms that underlie systems like Midjourney or Sora.
so if I understand correctly, you remask some tokens that were previously unmasked?
The title says "on commodity GPUs" but the only GPU mentioned (and the only one with benchmarks) are Nvidia H100s ($30K+ on Ebay)?
Do these run on actual commodity GPUs like RTX 3090s and what kind of tokens/sec is expected on those?
Also, there's no paper, no open weights, no code. Just an API?
Companies like Groq and Cerebras already hit these kind of numbers over a year ago so I'm not seeing what's HN worthy here.
That's a good point. In this context, we've been using "commodity GPUs" to refer to standard Nvidia hardware, in contrast to specialized chips like Groq and Cerebras. While these chips also achieve fast speeds, they are not nearly as ubiquitous as Nvidia GPUs. We think that matching their performance on standard Nvidia hardware can make AI much more affordable. We also support any GPUs, not just H100's.
We're going to be releasing a tech report soon, stay tuned!
“Commodity” and “consumer” are not the same thing; H100 is commodity but not consumer, RTX 3090 is consumer and commodity.
An h100 is absolutely not a commodity product. A commodity product is one that is fungible (and that interchangeability is also reflected by second order effects like price). A 4090 is replacable with a 7900xtx. Not the same with an h100 and an instinct mi360
This work is great. May I ask the maximum token length supported by this model? During inference, do we process the maximum token length directly? in other words, what is the runtime estimate for different prompt length and different output length?
just tried out the model in the playground and it seems pretty fast. if what they claim is true, then this could be concerning for cerebras.
I don't see why it would be concerning for Cerebras. Intelligence will likely still scale with inference time compute.
Does it mean coding tools making use of MDLMs can more precisely edit code chunks without prompting hacks? Seems like there is absolutely no downside, looking forward to all SOTA models switching to this technique soon!
Awesome stuff. I've read some of the probabilistic ML work from you guys and am stoked to see a competitive alternative to autoregressive decoders. Best of luck!
Volodymyr, can you comment on your organization's approach to safety? It would appear that your model will be a step-change in the capabilities/FLOP, which could change the scale of applications. Also being a novel architecture, have you studied how what new failure modes may exist or how existing alignment techniques may or may not be applicable?
Please make sure aider and llm-cli can use this soon,kthx :-)
Is their a paper / technical report out on this?
https://ml-gsai.github.io/LLaDA-demo/
As far as I understand (based on reading for less than 5 minutes) it is still a transformer model, but they simply start predicting tokens in random positions, with the possibility of updating existing tokens, rather than producing tokens from left to right.
That isn't too different from say good old stable diffusion.
That sounds like it would be far less computationally efficient. How can there be an efficiency gain over autoregression?
GPUs compute in two dimensional vectors, and auto-regressive models compute in sequence. Diffusion seems to me to be the more efficient and natural fit for computation on GPUs.
When o3 solved the ARC[1] challenge, Sam Altman said it was recognizing the image pixel by pixel linearly, and i found it pretty hilarious that they start by using a chip which computes in two dimensions, they use it to compute a line, and that line is a two dimensional image of the ARC challenge.
[1] https://arcprize.org/
Not today, but we will be following up with a technical report over the next week or so. In the meantime, you can take a look at some of the research papers that inspired our work: - https://arxiv.org/abs/2310.16834 - https://arxiv.org/abs/2406.07524
Holy shit this is fast