Bugs

This is a living document of the horrible things I did to my code base and my mental health. I hope it’s entertaining.

Bug 0009

When: Apr, 2024

Symptoms: Could not use all four chips on a TPU v4-8 device with accelerate library.

Time taken: 5 hours

Solution: Set num_processes=null in default accelerate config. This will pass to xmp.spawn for multi-process scripts, which signals null for all chips available.

How: Stepping through tutorials like this. And really understand the mapping of one chip per process.

Bug 0008

When: Apr, 2024

Symptoms: Shape mismatch error in backward path when gradient checkpointing. Similar to this.

Time taken: 2 hours

Solution: Do not use_cache if gradient checkpointing is enabled. PR

How:

  1. Omer pointed out this shape error is due to gradient checkpointing.
  2. Notice that previously Idefics had a similar patch to disable use_cache.

Bug 0007

When: Apr, 2024

Symptoms: Cannot find devices on TPU v4-32 VMs. Specifically, TPU driver finds the devices (/dev/accel* exists) but no worker joins the slice (times out after 15m).

Time taken: 3 days

Solution:

  1. Use the recommended runtime tpu-ubuntu2204-base.
  2. Follow the correct tutorial. I was following Run a calculation on a Cloud TPU VM using PyTorch but actually v4-8 is a VM instance but v4-32 is a pod. So Run PyTorch code on TPU Pod slices is the one to go.
  3. A pod and a single VM are difference because a pod needs multiprocessing (kudos Yoav), and xm.xla_device() cannot stand alone in global scope.
  4. So really, follow the correct tutorial solved the problem.

How:

  1. sudo lsof -w /dev/accel0
  2. TPU logs cat /tmp/tpu_logs/tpu_driver.INFO and TPU_STDERR_LOG_LEVEL=0 TPU_MIN_LOG_LEVEL=0
  3. XLA logs PT_XLA_DEBUG=1 and XLA trouble-shooting guidelines

Bug 0006

When: Mar, 2024

Symptoms: Cannot reproduce training results with/without gradient accumulation

Time taken: One week

Solution:

  1. Set seed before model initialization.
  2. Disable dropout layers (in the base model as well as the adapters). BatchNorm is also grad-accum unsafe.
  3. CrossEntropyLoss (with reduce=’mean’ by default) may not be averaged if the -100 masks are applied to the targets.
  4. Play with dtype. Floating point errors arise when I try to compute the loss in float32 vs float16. It makes things more complicated in gradients because I also used QLoRA.
  5. Give up because training is only tractable with QLoRA in my case.

How: Observe numerical values. Set up two debuggers side by side.

Bug 0005

When: Mar, 2024

Symptoms: None. Agony from slow training of a language task.

Time taken: Two months

Solution:

  1. Instead of a fixed max sequence length, use dynamic padding.
  2. After getting the attention mask from the processor, reset attention mask of padding tokens to zero.

How: Observe numerical values.

Bug 0004

When: Mar, 2024

Symptoms: Cannot reproduce training results on the same hardware using the same configuration and seed.

Time taken: One day

Solution: set_seed(42) before model initialization.

How: Control experiments and Google search

Bug 0003

When: Feb, 2024

Symptoms: Cannot reproduce training results on the same hardware using the same configuration and seed. Train loss curves were close in the first 3 epochs, but diverged after. Same for validation curves.

Time taken: Two days

Solution: Downgrade transformers library.

How: Talk to kind experts

Bug 0002

When: Sometime in fall 2023

Symptoms: My chat bot serving online and offline gave different responses to the same requests.

Time taken: Two days

Solution: Online and offline models used different pre-processors.

How: Observe numerical values flowing into serving vs evaluation pipeline.

Bug 0001

(Fingers crossed I will not overflow 9999)

When: Sometime in spring 2020

Symptoms: I was building a prototype mobile robot. Voltage between one motor was consistently lower than expected.

Time taken: One hour

Solution: I put one battery in reverse.

How: Ask for help from TA