When my company was still working closely with CN factories a few years ago (before the bans / clients no longer wanting to work with companies working with china etc), the CEO's of the factories we worked with all were electronic engineers at that company or another before; they all could jump in, debug schematics, sold and write firmware themselves. And they did. These were places with massive campuses with towering buildings with robots and a few (relative to the massive space) employees doing maintenance etc + prototyping.
It sounds so more reasonable to have a director who is actually technical, doesn't it? I'm absolutely amazed how this (to the east) contrasts to understanding (to the west) that directors rather need to know finance, strategic planning, and marketing, than the actual nuance of the work.
To be blunt this is exactly what is wrong with the “leadership” mindset in the west, as decisions are often made without understanding the “nuances” yet they are confident it would work.
In very simple words: it is one way to reduce the white squares in the picture from @danielhanchen[1].
In more complex words: imagine a processor which is able to process every instruction in 10 clock cycles. But also the processor is able to get new input for this instruction on every clock cycle and starts to process this new input in a pipeline. After the first input you have to wait ten clock cycles. But if you feed the input line every time you will get the output also permanently.
In the case of GPUs, it is now not only a topic of a single pipeline, but multiple in parallel. Depends on your data and algorithm it can be thousands in parallel.
I hope all the open sources Deepseek is doing encourages American labs to do more of the same. Surely they'll realize their momentum is more of a moat than their tech at any one point in time.
> DualPipe was created and developed by Jiashi Li and Chengqi Deng and Wenfeng Liang.
A CEO who codes.
When my company was still working closely with CN factories a few years ago (before the bans / clients no longer wanting to work with companies working with china etc), the CEO's of the factories we worked with all were electronic engineers at that company or another before; they all could jump in, debug schematics, sold and write firmware themselves. And they did. These were places with massive campuses with towering buildings with robots and a few (relative to the massive space) employees doing maintenance etc + prototyping.
It sounds so more reasonable to have a director who is actually technical, doesn't it? I'm absolutely amazed how this (to the east) contrasts to understanding (to the west) that directors rather need to know finance, strategic planning, and marketing, than the actual nuance of the work.
To be blunt this is exactly what is wrong with the “leadership” mindset in the west, as decisions are often made without understanding the “nuances” yet they are confident it would work.
"developed" and "codes" have different meanings.
Yes but in this context, they are very close to each other in meaning.
Besides Liang does indeed code a significant amount and has contributed to almost all of their published papers.
I attached all 3 algorithms 1F1B (1 forward 1 backward), ZB1P (zero bubble pipeline parallelism) and DualPipe as a picture here: https://x.com/danielhanchen/status/1894937006352031832 for those interested :)
Maybe add Chimera as well?
https://arxiv.org/pdf/2107.06925
it looks as if Chimera has marginally less bubbles than DualPipe?
Oh more nice pictures :)
Off topic, but this is the Rick and Morty episode where Rick creates a perfectly level space.
The symmetry is uuugh.
You'll have to refresh my memory :) Is there like a Youtube clip for it?
https://www.youtube.com/watch?v=-MwCJpEuC44
Sorry for us utter simpletons can someone explain what it do?
It makes it so that having more GPUs makes inference run faster. Worst case has been you can only use memory from them and gain no speed at all
[flagged]
In very simple words: it is one way to reduce the white squares in the picture from @danielhanchen[1].
In more complex words: imagine a processor which is able to process every instruction in 10 clock cycles. But also the processor is able to get new input for this instruction on every clock cycle and starts to process this new input in a pipeline. After the first input you have to wait ten clock cycles. But if you feed the input line every time you will get the output also permanently.
In the case of GPUs, it is now not only a topic of a single pipeline, but multiple in parallel. Depends on your data and algorithm it can be thousands in parallel.
[1] https://x.com/danielhanchen/status/1894937006352031832
I hope all the open sources Deepseek is doing encourages American labs to do more of the same. Surely they'll realize their momentum is more of a moat than their tech at any one point in time.
Does this remind anyone else of the Pied Piper compression algorithm?
Middle out or something?
Hmm weren’t there also supposed to be the SM re-allocation, doesn’t look like it was included; I may have been mis-remembering the explanation.
[dead]