Fire-Flyer File System (3FS)

361 points by wenyuanyu 8 months ago

ammo1662 8 months ago

For those who are interested, the design was originally published here:

(Chinese) https://www.high-flyer.cn/blog/3fs/

This file system has been developed and utilized by them for several years .

Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.

I google translated some key parts here:

3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.

Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.

vlovich123 8 months ago

I hope they chose a multiple of 4096 for the alignment to minimize flash read amplification. QLC drives even use 16kib pages.
dekhn 8 months ago

How critical is random reading of training data when assembling batches?
Put another way: in my experience, supporting fast random reads is a challenging problem, while supporting high sequential reads is fairly straightforward. When is random access to a training set absolutely necessary for training a model?
- c4wrd 8 months ago
  
  Imagine you're studying for a test where you are given an image and need to answer the correct class. To prepare, you're given a deck of flashcards with an image on the front and the class on the back.
  (Random) You shuffle the deck every time you go through it. You're forced to learn the images and their classifications without relying on any specific sequence, as the data has no signal from sequence order.
  (Fixed order) Every time you go through the deck, the images appear in the exact same order. Over time you may start to unconsciously memorize the sequence of flashcards, rather than the actual classification of each image.
  When it comes to actually training a model, if the batches are sampled sequentially from a dataset, it risks learning from correlations caused by the sequencing of the data, resulting in poor generalization. In contrast, when you sample the batches randomly, the model is biased and encouraged to learn features from the data itself rather than from any signals that arise from artifacts of the ordering.
  - dekhn 8 months ago
    
    Why, then are so many successful models trained on multiple passes through sequential data? Note, I'm not naive in this field, but as an infra person, random reads make my life very difficult, and if they're not necessary, I'd rather not deal with them by having to switch to an approach that can't use readahead and other strategies for high throughput io.
    
    Terretta 8 months ago
    
    Why did they take time to invent an entire FS?
    
    dekhn 8 months ago
    
    The engineers at DeepSeek seem extremely smart and well-funded, so my guess is they looked at the alternatives and concluded none of them would allow them to make their models work well enough.
    
    rfoo 8 months ago
    
    They did this back in their trading firm days, and...
    Imagine that you have a sequence of numbers. You want to randomly select a window of, say, 1024 consecutive numbers, a sequence, as input to your model. Now, say, you have n items in this sequence, you want to sample n/c (c is a constant and << 1024) sequences in total. How to do fixed shuffle?
    The key is, we have overlap in data we want to read. If we brute force fixed shuffle and expand, we need to save 1024/c times more than original data.
    This isn't useful for LLMs, but hey, wonder how it started?
    
    dekhn 8 months ago
    
    I guess I'm much more of the "materialize the shuffle asychronously from the training loop" kind of person. I agree, the materialization storage cost is very high, but that's normally been a cost I've been willing to accept.
    As an ML infra guy I have had to debug a lot of failing jobs over the years, and randomizing datapipes are one of the hardest to debug. Sometimes there will be a "record-of-death" that randomly gets shuffled into a batch, but only causes problems when it is (extremely rarely) coupled with a few other records.
    I guess I'll just have to update my priors and accept that inline synchronous randomization with random reads is a useful-enough access pattern in HPC that it should be optimized for. Certainly a lot more work and complexity, hence my question of just how necessary it is.
    
    rfoo 8 months ago
    
    Yeah, I don't want to do this either. This is a super special case, after exploring alternatives with our researchers it's unfortunately needed. As for record-of-death, we made sure that we do serialize all rng state and have our data pipeline perfectly reproducible even when starting from checkpoint.
    Building a system for serving read-only data at NVMe SSD speed (as in IOPS) took surprisingly few effort, and is mostly enough for training data. Kudos to DeepSeek who decided to spend extra effort to build a full PFS and share it.
- arjvik 8 months ago
  
  On an SSD, random and sequential reads have nearly the exact same performance. Even on large arrays of spinning rust this is essentially true.
  - dekhn 8 months ago
    
    This hasn't been my experience; I see much higher sequential read results compared to random reads on a wide range of storage from low-end home PC SSDs to high end NVME flash storage in large servers.
    It's certainly not true on actual hard drives, and never has been. A seek is around 10ms.
  - bezmiran 8 months ago
    
    By what metric? I think this is close to true for identical blocksizes, but most benchmarks test sequential transfers with large 1M blocks and random ones with small 4K blocks. In this case, the speed of the fastest NVME drives is more than double for sequential transfers than it is for random ones.
    I don't like comparing the two, they're completely different workloads and it's better IMO to look at the IOPS for random transfers, which is where newer, faster SSDs truly excel, and where most people "notice" the performance.
rvba 8 months ago

Why is that a random read? Also is it truely random, or from seed? But if prng then they could cache right?
- kilburn 8 months ago
  
  Random is prng. They still cannot cache though because they do many reading "passes" through the same data.
  If build a cache that gets hits on the first pass, then it won't work for the second and later passes.

codingwagie 8 months ago

I think the difference between deepseek and OpenAI/Anthropic is one of the difference between practitioners and academics. Ofcourse there is world class talent at OpenAI. But there are also alot of "I went to Harvard and want to work in AI", and those types of people just simply dont have the technical exposure to even think of building something like this.

tway223 8 months ago

I would say most if not every large company in China has their own AI infra stack, partially because tech talent is relatively more abundant and partially some of the tech leads have been exposed to western tech via open source and work experience so they have a good success rate (which makes it a more common practice). Anecdotally, specifically Google, FB ex-employees from oversea offices, MSFT and Intel ex-employees from their China offices could be the key elements for this trend in the past two decades (Google left China around 2010).
The infra work is usually technically tedious so I think it may become some lost art in the west just like those manufacturing jobs.
- smallmancontrov 8 months ago
  
  As opposed to the US, where every large company has its own AI infra stack, often extending down to the silicon and up to large open source projects?
  What's going on here, why are people forgetting what's around them? Does familiarity breed contempt? Are attention spans so shot that failure to participate in this week's news cycle is enough for "out of sight, out of mind"? Or is HN full of Chinese bots now?
  - tway223 8 months ago
    
    That was to answer the previous question. Also the point is why Chinese companies can produce infra work in a cheap and fast way. With regard to US companies, I don’t see that is possible with MSFT, AMZN, AAPL, and likely GOOG as well. (Don’t get me wrong, they all have solid infra, probably except Apple)
sureglymop 8 months ago

I think it would be a bit irrational to claim that so broadly. I've met some incredibly talented people in academia and I've also met people that made me question how they even pass.
My hypothesis is that there is not such a big difference at all. All three of the companies you mentioned are world class competitors in this. DeepSeek were the last to have a "hit" but that isn't an indication that they'll be the next of the three (or other yet unknown entities) to have the next hit. We try to predict what happens next now but perhaps we should rather focus on who or what we want to succeed. For me it's quite clear: it should be open source or I'm long term not that interested.
mustpax 8 months ago

Someone should write a blog post about the prestige/effectiveness negative feedback loop. This is also the Achilles heel of top tier SV VCs including YC.
- bugglebeetle 8 months ago
  
  The problem isn’t the prestige it’s that prestigious institutions in America don’t produce high-quality talent. They’re instead mostly corrupt credentialing mills for the rich and well-connected. From what I understand, DeepSeek also only hires from the best universities in China, but “best” actually means something relative to how difficult entrance to those organizations is to achieve and their coursework.
  - InkCanon 8 months ago
    
    I read this too but there was no source on this. The founder Liang Wenfeng himself comes from Zhejiang university. Its admissions rate is 20%, which is much higher than traditional US "elite" schools. Wenfeng has said this about hiring though:
    "If you are pursuing short-term goals, it is right to find people with ready experience. But if you look at the long-term, experience is not that important. Basic skills, creativity, and passion are much more important.”
    
    ycui1986 8 months ago
    
    The Chinese college application works way differently from American ones. The admission rate is meaningless. Zhejinag University is state assigned 985 university (there are in total 9 of them). Believe me any students in elite high schools in China will be very happy if they can be accepted into Zhejiang University. Most of them unforunately don't have the score to even think of applying. It technically is not applying. Students take the once a year exam, if they don't score higher than top 500 in their province, don't even think about trying to apply Zhejing Univ.
- djtango 8 months ago
  
  You mean this one? https://news.ycombinator.com/item?id=9125816
- dvaun 8 months ago
  
  Can you expand on this?
robotnikman 8 months ago

Makes me wonder where is the best place to learn how to put together and operate something like this then? Certainly there should be resources out there somewhere to teach yourself?
cma 8 months ago

Weren't the flash attention authors not just from academia but in academia at the time?

thohj4234234324 8 months ago

This is very humbling.

OpenAI et. al kind of have also been very deep down the systems rabbit hole (eg. Triton), but I can't think of anyone else (outside of Google/Facebook) who pay this amount to attention to things.

Great work; hope Deepseek does even more awesome things going forward.

richardw 8 months ago

I’ve assumed that it’s partly because the company has done a lot of HFT, which is very focused on performance. But I’m not an expert in either.
- WiSaGaN 8 months ago
  
  Indeed, the blog mentioned in the other comment showed part of 3FS code was completed at least since 2019, when this was still a project of the quant funds. In HFT, you tend to dogfood a lot of the things to achieve low latency, high performance, sometimes just because HFT system just need to do one specific thing, and those off the shelf stuff usually cater for a lot wider scenarios where HFT doesn't really care about. Here you see similar case which they focus specifically on loading large amount of data during training, and implement that to the extreme.

tetron 8 months ago

Was curious how they get such performance with a FUSE based design. It seems that they sort of cheat, FUSE is used to manage metadata but to get high performance you have to link in the C++ client library and do all your reads and writes through that. So it isn't general purpose, you have to modify your application to take advantage of it. Still, that's a clever trick, and makes me wonder if there's a LD_PRELOAD strategy that could generalize.

grohan 8 months ago

They appear to have Python bindings which seems reasonable from an API / usability perspective? https://github.com/deepseek-ai/smallpond
In terms of fast FUSE - also my first question, appears to be`io_uring` + FUSE :)
https://github.com/deepseek-ai/3FS/blob/main/src/lib/api/Usr...
amelius 8 months ago

Why is FUSE that much slower than providing your own read/write functions? I get that it has to go through the kernel, but the operations are on entire blocks and network should be the bottleneck by far (and disk/main memory should be a bottleneck if the data is local).
- vlovich123 8 months ago
  
  You have to bounce through the kernel back out to use space. The number of syscalls is quite high. In many cases this is mitigated somewhat by the page cache making reads cheaper, but that’s explicitly an anti design here.
  I believe there’s work to minimize this using io_uring so that you can talk to the fuse driver without the kernel being in the middle, but that work isn’t ready last time I checked.
  For what it’s worth at Palm we had a similar problem because our applications were stored compressed but exposed through fuse uncompressed, instead of O_DIRECT I just did an fadvise to dump the cache after a read. Not as high throughput but the least risky change to get the same effect.
  - sweettea 8 months ago
    
    Fuse over io_uring has just been merged: https://www.phoronix.com/news/Linux-6.14-FUSE
    So has uncached buffered IO: https://www.phoronix.com/news/Uncached-Buffered-IO-Linux-6.1...
    6.14 is an exciting kernel!

pella 8 months ago

related research paper (english - html ) - https://arxiv.org/html/2408.14158v2

arXiv:2408.14158v2 [cs.DC] 31 Aug 2024

"Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning"

Abstract:

"The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC."

hintymad 8 months ago

A distributed file system is honed as one of the trickiest software to write, and we are usually advised not to write a file system from scratch (even on top of FUSE), let alone a highly optimized one. When a silicon value company is having the 100th meeting to align god-knows-what, a team of fewer than 60 already came up with a production-grade highly efficient parallel file system.

Have we in the valley companies lost touch?

htrp 8 months ago

> team of fewer than 10
the highflyer team are pretty well resourced.... think they have more than 10 people
- hintymad 8 months ago
  
  Thanks! Updated to 60 per their author list in their paper.
  - ssivark 8 months ago
    
    Why do you assume everyone in the company (including folks working on infra) are authors on the paper? That’s possible, of course, but isn’t it unlikely?
    
    hintymad 8 months ago
    
    That's the only information I have. That said, High-Flyer had about 160 people, total, in 2021. Given that F3 was in production in 2019, 60 people is a generous estimation.
ycui1986 8 months ago

yes

jauntywundrkind 8 months ago

Man, 6.6TB/s across 180 nodes is 300Gbps/node, or 37.5GBps.

That's with 14 unnamed SSD per node. I wonder how this would scale to higher end SSD,dealing from PCIe 4 to PCIe 5 or PCIe 6... Particularly whether one could scale down!

bee_rider 8 months ago

They sure are productive.

What are we going to see tomorrow? DeepSeek OS or something?

vitaflo 8 months ago

To be fair they’ve been working on this since 2019 for HFT. So it’s not like they just whipped this up.
- bee_rider 8 months ago
  
  The amount of brainpower wasted on HFT games that will never see the light of day is kind of a bummer. Congrats to China.
logicallee 8 months ago

>They sure are productive.
I have a theory as to why...
- tuyguntn 8 months ago
  
  enlighten us
  - logicallee 8 months ago
    
    my theory is that their own DeepSeek writes the code for them, so they are highly productive.
    
    bee_rider 8 months ago
    
    I’d expect them to highlight that, rather than keep it secret. I wouldn’t be surprised if they used it a little bit, but probably not to an extent that is really unique or unusual.
    
    menaerus 8 months ago
    
    That would be terrifying in itself if true because for this type of work you really need the best of the best. But I doubt this is the case here. LLMs as we know them today are not quite yet there for this type of work.
    
    genewitch 8 months ago
    
    Do you, though? Need the best of the best?
    
    bee_rider 8 months ago
    
    As someone who did some simulation focused engineering grad school stuff; there is a tendency for some of the best to go become quants. Does the field need it? I don’t know. But for whatever reason the draw of the “print money using math tricks” seems to attract some hardcore folks, haha.
    It is really frustrating to see good engineers go to play trading games. We should study how exactly it is China managed to unlock this capacity.
    
    htrp 8 months ago
    
    Government effectively banned unproductive tech (adtech, fintech etc) and told ppl to go do stuff like robotics and AI
    
    bee_rider 8 months ago
    
    I’m beginning to suspect this invisible hand isn’t as clever as it was made out to be.
    
    genewitch 8 months ago
    
    |O|I|L|!|o o|W|A|R|!|
    
    menaerus 8 months ago
    
    I think so yes. There's very few engineers that can pull out this type of work IME. In a pool of ~30M SEs around the globe I'd say there's no more than ~30K of such engineers. This is 0.001% and it's a very optimistic number I'd say.
    Why do you think this would be controversial? This isn't an every day work.
    
    reissbaker 8 months ago
    
    They wrote this in 2019, well before any useful codegen LLMs existed.
    https://www.high-flyer.cn/blog/3fs/
    
    codydkdc 8 months ago
    
    lol
- digdugdirk 8 months ago
  
  996 work culture?
  - Forbo 8 months ago
    
    That's been illegal for three and a half years?
    
    rfoo 8 months ago
    
    That's been illegal since May 1995 (before that China had six working days week).
    Does it really matter whether it's illegal or not, if there is no enforcement? Pinduoduo (in other name, Temu) has been doing 70 hours week since they started. Yes, they are still doing it right now.
  - deep3000 8 months ago
    
    Sergey Brin: a 60-hour workweek is the “sweet spot” for productivity in a recent memo sent to employees.

yalogin 8 months ago

It’s not clear to me where and how the current popular systems fall short. Do they talk about I anywhere?

Also, what specifically is the data access patterns for training and inference that are different from traditional use cases?

jpgvm 8 months ago

Well current popular systems are pretty much limited to Lustre and the new kid Weka, mostly Lustre though tbh.
You can try to use "standard" options like MinIO/Ceph(RADOS)/SeaweedFS but you will very quickly learn those systems aren't remotely fast enough for these usecases.
AI training is what this is used for, not inference (which has absolutely no need for any filesystem at all). What makes the workload somewhat special is that it's entirely random read and not cacheable at all as most reads are one and done.
Would Lustre be perfectly fine at 6TiB/s? Yes. Is it a huge pain in the ass to operate and make remotely highly available? Also yes. If this thing is capable of the throughput but easier to operate and generally more modern and less baroque it's probably an improvement. TLDR is Lustre is fast but that is literally it's only redeeming quality. I have lost far too many hours of my life to the Lustre gods.
- rfoo 8 months ago
  
  > What makes the workload somewhat special is
  I'll add that latency also doesn't matter that much. You are doing batched data loading for batch n+1 on CPU when GPUs are churning batch n-1 and copying batch n from host memory at the same time.
  So as long as your "load next batch" doesn't run for like >1s it would be fine. But one single "load next batch" on one worker means thousands (if not more) random read.
- cyanf 8 months ago
  
  They’re using the FS for caching the KV caches of past requests. It’s why they’re able to charge so little on prompt cache hit.
  - jpgvm 8 months ago
    
    Ahh I missed that. Yes prefix caching and RAG are 2 cases were you will want something like this during inference time.

budududuroiu 8 months ago

Does anyone know if there’s a benefit to porting this to an orchestrator like K8s, maybe overkill for training but the KVCache might be useful when having multiple replicas for inference?

do_not_redeem 8 months ago

Can someone convince me this isn't NIH syndrome? Why would you use this instead of SeaweedFS, Ceph, or MinIO?

mgerdts 8 months ago

> The final aggregate read throughput reached approximately 6.6 TiB/s with background traffic from training jobs.
The Ceph team has been working on Crimson for years to get past performance bottlenecks inherent to the HDD-based design. I’m having troubles finding any ceph benchmark results that show any close to 100 GB/s.
- pat2man 8 months ago
  
  Seems easy to find: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
  - menaerus 8 months ago
    
    3FS: 180 nodes, 2x200Gbps InfiniBand and 16x 14TiB NVMe SSDs per node, ~500 clients, 6.6 TiB/s of read throughput with training jobs workload
    Ceph: 68 nodes, 2x100Gbps Mellanox and 10x 14TiB NVMe SSDs per node, 504 clients, 1TiB/s of FIO random read workload
    
    ibotty 8 months ago
    
    The comparison is a little pears to apple. Similar nutritions but different enough to not draw conclusions. The hardware in the Ceph test is only capable of max 1.7TiB/s traffic (optimally without any overhead whatsoever).
    I also assume that the batch size (block size) is different enough that this alone would make a big difference.
    
    menaerus 8 months ago
    
    Even if we take different hardware into account we can readjust for measured vs theoretical throughput.
    Ceph cluster achieves 1 TiB/s / 1.7 TiB/s = 0.58% of theoretical throughput.
    3FS cluster achieves 6.6 TiB/s / 9 TiB/s = 0.73% of theoretical throughput.
    
    ibotty 8 months ago
    
    That difference is still pronounced, yes. But the workload is so different. Training AI is hardly random read. Still not a comparison which should lead you to any conclusions.
- nivertech 8 months ago
  
  I'd argue that they don't need a filesystem or an object storage, they need a purpose-built data serving layer optimized for their usecase.
ein0p 8 months ago

Seems like Ceph is considerably lower in throughput: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/ A serious concern when saving hundreds of terabytes of weights and optimizer states every now and again, or loading large precomputed prefix KV-caches. Minio seems to be slower still. IDK about SeaweedFS - they don't mention performance in their selling points at all.
- ibotty 8 months ago
  
  Look at the hardware first:
  The hardware in the Ceph test is only capable of max 1.7TiB/s traffic (optimally without any overhead whatsoever).
  I also assume that the batch size (block size) is different enough that this alone would make a big difference.
- do_not_redeem 8 months ago
  
  It's quite funny that I got two opposite answers right away: you say it's to improve throughput, and sibling says it's to improve latency, and as we know throughput and latency trade off against each other. I'm inclined to agree it's more likely they're prioritizing throughput, since their readme charts throughput but not latency. But OTOH, the project looks like it requires RDMA. I wonder if the authors have written about their motivations and the tradeoffs they made, so we don't have to speculate.
  EDIT: Their blog post answered all my questions and more. https://www.high-flyer.cn/blog/3fs/
  - ein0p 8 months ago
    
    Because the two are interconnected and aren't in conflict with each other. You not only want high throughput - that by itself would be quite limiting. You want it along with low latency as well, or else it's very easy to end up in the situation where your throughput is effectively zero if the access pattern is "bad".
jpgvm 8 months ago

None of those are close to fast enough.
The only competitors in the parallel FS space that are useful for this are Lustre and Weka.
Otherwise if you don't need a single namespace a bunch of fat AF NFSv4 servers w/NFS over RDMA will also get you to 6TiB/s.
The "surefire" way though is still Lustre, it's the big daddy of distributed parallel filesystems still but it's an absolute beast to setup and operate.
startupsfail 8 months ago

It’s not. When you are a high frequency trader and you’ve mastered RDMA, everything around you looks slow. You are thinking in terms of 20 nanoseconds intervals, while everyone around still thinks that serving a query under a millisecond is fast.
- rfoo 8 months ago
  
  Huh? What kind of RDMA has a completion latency of 20 nanoseconds? It's more like 5 microseconds.
  I agree that a lot of "modern" storage stack is way too slow though, tried to find a replication-first object storage for crazy-fast random read in small number of objects last year and found none.
  - tucnak 8 months ago
    
    Completion latency is one thing, bandwidth would be another. There's apparently a whole world of Alveo SmartNIC's and related FPGA platforms, and it can totally get in nanosecond range for whatever nails that may fit the compute-in-network hammer, even if bound by latency of the consuming system / RDMA interface. Also: https://github.com/corundum/corundum is really popular with the Chinese!
  - startupsfail 8 months ago
    
    I was talking about, thinking in terms of 20 nanoseconds intervals, rather than completing a request in 20 nanoseconds. To get 1 microsecond wire-to-wire latency you do need to count your nanoseconds.
    Why this number - this is because it’s roughly the time it takes to read 64 bytes from L3 cache. And NICs tend to be able to push data into L3 (or equivalents).
    Current state of the art - look up nanoPU, from Stanford. Wire-to-wire under 100ns is not impossible, but this would normally assume pre-cooked packet, selected from a number of packets (which is not an unusual scenario in HFT).
    
    rfoo 8 months ago
    
    Ah, makes sense. Sadly RDMA isn't that fast for now, or at least commercial RNICs/switches don't :( Once you left your host in data center network, everything counts in microseconds.
doomleika 8 months ago

Software tech in China is a different landscape. It's really common to reinvent the wheel in China. Almost every big name (Bytedance, Meituan, etc) have their own of everything with both office political and in house need reasons.
The thing is those stuff are so prevalent those in house tech have reach the point they are competitive. This doubles for quant firm like DeepSeek.
cttet 8 months ago

If NIH syndrome boosts morale of the team, it should be helpful on overall team progress though.
whalesalad 8 months ago

Sometimes you must succumb to NIH. How do you think all those tools you mentioned got produced?

whalesalad 8 months ago

The throughput on those charts is pretty wild - multiple terabytes per second.

jeffbee 8 months ago

Interesting that their GraySort result is CPU bound while they are using 3x more CPUs than the record holder from ten years ago.

sitkack 8 months ago

How can you determine that it CPU bound from the attached charts?
- jeffbee 8 months ago
  
  Because it hits a read peak in the first wave and never hits it again.
  - sitkack 8 months ago
    
    Could be many other reasons, giving it more CPU won't necessarily increase the read rate.

brcmthrowaway 8 months ago

What does Anthropic use?

rvz 8 months ago

Once again, DeepSeek continues with another home run.

Can't wait to see what they release next. DeepSeek should be studied carefully.

WithinReason 8 months ago

Why is this even necessary? Can you just shard your training set to the training nodes ahead of time instead?

wenyuanyu 8 months ago

No, besides accessing training data, there is also logging and checkpointing... When you run k8s over it, and there are multiple training jobs... isolated local storage is a nightmare...

pepsi-not-coke 8 months ago

I love it. AWS EFS costs too much. The open source solutions are clunky. I'm hoping DS applied their ingenuity to this one, too. Can't wait to trial it.