DL in NLP@dlinnlp P.1781

DL in NLP

Soumith Chintala (создатель pytorch) выдаёт базу о том как тренироваться на 10К GPU
x.com/soumithchintala/status/1841498799652708712

Оч короткий TL;DR (всем рекомендую прочитать оригинал, он не длинный)

1. Maximize batch size and GPU utilization: 3D parallelism + gradient checkpointing
1. Overlap communication, e.g. while N-1th layer is computing backward, all GPUs with an Nth layer can all-reduce
1. Optimize for your GPU cluster network topology

1. Failure recovery, at 10k GPU scale, things fail all the time -- GPUs, NICs, cables, etc
1. At 10K scale bit flips actually become a problem and can cause loss explosions. Save your model state as frequently and as quickly as you can. To speed it up save it in shards and to CPU memory first and then in a seaprate thread write to disk

🔥37❤20👍9

www.tgoop.com/dlinnlp/1781

12.8K viewsVlad Lialin, edited Oct 2, 2024 at 16:46

tgoop.com/dlinnlp/1781

Create: 2024-10-02
Last Update: 2025-07-12 21:37:05

BY DL in NLP

Share with your friend now:
tgoop.com/dlinnlp/1781

Telegram News

Soumith Chintala (создатель pytorch) выдаёт базу о том как тренироваться на 10К GPU