Bugs

This is a living document of the horrible things I did to my code base and my mental health. I hope it’s entertaining.

Bug 0013

When: June, 2024

Symptoms: val/accuracy had high variance

Time taken: 2 days

Solution: Filter val samples. Remove unreliable ones.

Bug 0012

When: May, 2024

Symptoms: Models are confidently wrong.

Time taken: 2 days

Solution:

  1. Plot val/loss. My model was massively overfitted. I should have trained for 10 epochs instead of 30 epochs.
  2. Add regularization by label smoothing, or reducing model complexity (e.g. LoRA target modules).
  3. Read about model calibration (sequence probability aligned with actual accuracy), plotting reliability diagram. On Calibration of Modern Neural Networks
  4. Overfitting to loss doesn’t always lead to overfitting to accuracy. How is it possible that validation loss is increasing while validation accuracy is increasing as well

Bug 0011

When: May, 2024

Symptoms: Cannot fit one batch on Idefics2 with 10 images on TPU v4-8

Time taken: 3 hours

Solution: Avg pool image latents https://huggingface.co/HuggingFaceM4/idefics2-8b/discussions/18. Reducing image_seq_length from 64 to 32 allows me to feed 3 examples in one batch.

Bug 0010

When: May, 2024

Symptoms: Shape in compatibility in the backward pass in torch_xla. Similar to this

Time taken: 3 hours

Solution: Idefics2 had funny ways of expanding image tokens. I’m not sure why but using a longer max_length for sequence helped.

Bug 0009

When: Apr, 2024

Symptoms: Could not use all four chips on a TPU v4-8 device with accelerate library.

Time taken: 5 hours

Solution: Set num_processes=null in default accelerate config. This will pass to xmp.spawn for multi-process scripts, which signals null for all chips available.

How: Stepping through tutorials like this. And really understand the mapping of one chip per process.

Bug 0008

When: Apr, 2024

Symptoms: Shape mismatch error in backward path when gradient checkpointing. Similar to this.

Time taken: 2 hours

Solution: Do not use_cache if gradient checkpointing is enabled. PR

How:

  1. Omer pointed out this shape error is due to gradient checkpointing.
  2. Notice that previously Idefics had a similar patch to disable use_cache.

Bug 0007

When: Apr, 2024

Symptoms: Cannot find devices on TPU v4-32 VMs. Specifically, TPU driver finds the devices (/dev/accel* exists) but no worker joins the slice (times out after 15m).

Time taken: 3 days

Solution:

  1. Use the recommended runtime tpu-ubuntu2204-base.
  2. Follow the correct tutorial. I was following Run a calculation on a Cloud TPU VM using PyTorch but actually v4-8 is a VM instance but v4-32 is a pod. So Run PyTorch code on TPU Pod slices is the one to go.
  3. A pod and a single VM are difference because a pod needs multiprocessing (kudos Yoav), and xm.xla_device() cannot stand alone in global scope.
  4. So really, follow the correct tutorial solved the problem.

How:

  1. sudo lsof -w /dev/accel0
  2. TPU logs cat /tmp/tpu_logs/tpu_driver.INFO and TPU_STDERR_LOG_LEVEL=0 TPU_MIN_LOG_LEVEL=0
  3. XLA logs PT_XLA_DEBUG=1 and XLA trouble-shooting guidelines

Bug 0006

When: Mar, 2024

Symptoms: Cannot reproduce training results with/without gradient accumulation

Time taken: One week

Solution:

  1. Set seed before model initialization.
  2. Disable dropout layers (in the base model as well as the adapters). BatchNorm is also grad-accum unsafe.
  3. CrossEntropyLoss (with reduce=’mean’ by default) may not be averaged if the -100 masks are applied to the targets.
  4. Play with dtype. Floating point errors arise when I try to compute the loss in float32 vs float16. It makes things more complicated in gradients because I also used QLoRA.
  5. Give up because training is only tractable with QLoRA in my case.

How: Observe numerical values. Set up two debuggers side by side.

Bug 0005

When: Mar, 2024

Symptoms: None. Agony from slow training of a language task.

Time taken: Two months

Solution:

  1. Instead of a fixed max sequence length, use dynamic padding.
  2. After getting the attention mask from the processor, reset attention mask of padding tokens to zero.

How: Observe numerical values.

Bug 0004

When: Mar, 2024

Symptoms: Cannot reproduce training results on the same hardware using the same configuration and seed.

Time taken: One day

Solution: set_seed(42) before model initialization.

How: Control experiments and Google search

Bug 0003

When: Feb, 2024

Symptoms: Cannot reproduce training results on the same hardware using the same configuration and seed. Train loss curves were close in the first 3 epochs, but diverged after. Same for validation curves.

Time taken: Two days

Solution: Downgrade transformers library.

How: Talk to kind experts

Bug 0002

When: Sometime in fall 2023

Symptoms: My chat bot serving online and offline gave different responses to the same requests.

Time taken: Two days

Solution: Online and offline models used different pre-processors.

How: Observe numerical values flowing into serving vs evaluation pipeline.

Bug 0001

(Fingers crossed I will not overflow 9999)

When: Sometime in spring 2020

Symptoms: I was building a prototype mobile robot. Voltage between one motor was consistently lower than expected.

Time taken: One hour

Solution: I put one battery in reverse.

How: Ask for help from TA