|
@nottombrown | |||||
|
(1/4) Learning ML engineering is a long slog even for legendary hackers like @gdb.
IMO, the two hardest parts of ML eng are:
1) Feedback loops are measured in minutes or days in ML (compared to seconds in normal eng)
2) Errors are often silent in ML twitter.com/gdb/status/115…
|
||||||
|
||||||
|
Tom Brown
@nottombrown
|
30. srp |
|
(2/4) Most ML people deal with silent errors and slow feedback loops via the "ratchet" approach:
1) Start with known working model
2) Record learning curves on small task (~1min to train)
3) Make a tiny code change
4) Inspect curves
5) Run full training after ~5 tiny changes
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
(3/4) Downside of ratchet approach is some designs can't be reached via small incremental changes.
Also hard to know *which* tiny code changes to make. This is where understanding under/overfitting, regularization etc is useful. See @josh_tobin_'s talk:
youtube.com/watch?v=GwGTwP…
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
(4/4) Within the ratchet approach, I want more tools and best practices for making feedback loops shorter and for making errors louder.
Below is a short list of development speed hacks that I have found useful.
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
ML dev speed hack #0 - Overfit a single batch
- Before doing anything else, verify that your model can memorize the labels for a single batch and quickly bring the loss to zero
- This is fast to run, and if the model can't do this, then you know it is broken
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
ML dev speed hack #1 - PyTorch over TF
- Time to first step is faster b/c no static graph compilation
- Easier to get loud errors via assertions within the code
- Easier to drop into debugger and inspect tensors
(TF2.0 may solve some of these problems but is still raw)
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
ML dev speed hack #2 - Assert tensor shapes
- Wrong shapes due to silent broadcasting or reduction is an extreme hot spot for silent errors, asserting on shapes (in torch or TF) makes them loud
- If you're ever tempted to write shapes in a comment, make an assert instead
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
ML dev speed hack #3 - Add ML test to CI
- If more than one entrypoint or more than one person working on the codebase, then add a test that runs for N steps and then checks loss
- If you only have one person and entrypoint then an ML test in CI is probably overkill
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
ML dev speed hack #4 - Use ipdb.set_trace()
- It's hard to make an ML job take less than 10 seconds to start, which is too slow to maintain flow
- Using the ipdb workflow lets you zero in on a bug and play with tensors with a fast feedback loop
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
ML dev speed hack #5 - Use nvvp to debug throughput
- ML throughput (step time) is one place where we have the tools to make errors loud and feedback fast
- You can use torch.cuda.nvtx.range_push to annotate the nvvp timeline to be more readable
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
Curious what other folks recommend for speeding up ML development feedback loops and for making errors louder.
|
||
|
|
||
|
Tom Brown
@nottombrown
|
30. srp |
|
cc tweeps who build stuff quickly: @karpathy, @catherineols, @jeremyphoward, @Thom_Wolf, @hardmaru, @goodfellow_ian, @soumithchintala, @D_Berthelot_ML, @josh_tobin_ , @mcleavey and @AlecRad.
|
||
|
|
||