Twitter | Pretraživanje | |
Tom Brown
(1/4) Learning ML engineering is a long slog even for legendary hackers like . IMO, the two hardest parts of ML eng are: 1) Feedback loops are measured in minutes or days in ML (compared to seconds in normal eng) 2) Errors are often silent in ML
Reply Retweet Označi sa "sviđa mi se" More
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
(2/4) Most ML people deal with silent errors and slow feedback loops via the "ratchet" approach: 1) Start with known working model 2) Record learning curves on small task (~1min to train) 3) Make a tiny code change 4) Inspect curves 5) Run full training after ~5 tiny changes
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @josh_tobin_
(3/4) Downside of ratchet approach is some designs can't be reached via small incremental changes. Also hard to know *which* tiny code changes to make. This is where understanding under/overfitting, regularization etc is useful. See 's talk:
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
(4/4) Within the ratchet approach, I want more tools and best practices for making feedback loops shorter and for making errors louder. Below is a short list of development speed hacks that I have found useful.
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
ML dev speed hack #0 - Overfit a single batch - Before doing anything else, verify that your model can memorize the labels for a single batch and quickly bring the loss to zero - This is fast to run, and if the model can't do this, then you know it is broken
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
ML dev speed hack #1 - PyTorch over TF - Time to first step is faster b/c no static graph compilation - Easier to get loud errors via assertions within the code - Easier to drop into debugger and inspect tensors (TF2.0 may solve some of these problems but is still raw)
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
ML dev speed hack #2 - Assert tensor shapes - Wrong shapes due to silent broadcasting or reduction is an extreme hot spot for silent errors, asserting on shapes (in torch or TF) makes them loud - If you're ever tempted to write shapes in a comment, make an assert instead
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
ML dev speed hack #3 - Add ML test to CI - If more than one entrypoint or more than one person working on the codebase, then add a test that runs for N steps and then checks loss - If you only have one person and entrypoint then an ML test in CI is probably overkill
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
ML dev speed hack #4 - Use ipdb.set_trace() - It's hard to make an ML job take less than 10 seconds to start, which is too slow to maintain flow - Using the ipdb workflow lets you zero in on a bug and play with tensors with a fast feedback loop
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
ML dev speed hack #5 - Use nvvp to debug throughput - ML throughput (step time) is one place where we have the tools to make errors loud and feedback fast - You can use torch.cuda.nvtx.range_push to annotate the nvvp timeline to be more readable
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @nottombrown
Curious what other folks recommend for speeding up ML development feedback loops and for making errors louder.
Reply Retweet Označi sa "sviđa mi se"
Tom Brown 30. srp
Odgovor korisniku/ci @karpathy @catherineols i 9 ostali
Reply Retweet Označi sa "sviđa mi se"