This is a living document of the horrible things I did to my code base and my mental health. I hope it’s entertaining.
Bug 0015
When: Sep, 2024
Symptoms: Times New Roman in matplotlib is lighter (color/weight) than on Overleaf visually
Time taken: 2 days
Solution: Use LaTeX backend for matplotlib, because I reallly don’t want to use tikz. I’m not sure why but sys.path.append
didn’t work in this case.
import os
import matplotlib.pyplot as plt
os.environ['PATH'] = '.../texlive/2024/bin/x86_64-linux:' + \
os.environ['PATH']
plt.rcParams['text.usetex'] = True
plt.rcParams['font.family'] = 'serif'
plt.rcParams['text.latex.preamble'] = r'\usepackage{times}'
Bug 0014
When: Sep, 2024
Symptoms: success rate plateaued in deployment after 4 rounds of continual learning
Time taken: 5 days
Solution: Add more LoRA adapters.
Bug 0013
When: June, 2024
Symptoms: val/accuracy had high variance
Time taken: 2 days
Solution: Filter val samples. Remove unreliable ones.
Bug 0012
When: May, 2024
Symptoms: Models are confidently wrong.
Time taken: 2 days
Solution:
- Plot val/loss. My model was massively overfitted. I should have trained for 10 epochs instead of 30 epochs.
- Add regularization by label smoothing, or reducing model complexity (e.g. LoRA target modules).
- Read about model calibration (sequence probability aligned with actual accuracy), plotting reliability diagram. On Calibration of Modern Neural Networks
- Overfitting to loss doesn’t always lead to overfitting to accuracy. How is it possible that validation loss is increasing while validation accuracy is increasing as well
Bug 0011
When: May, 2024
Symptoms: Cannot fit one batch on Idefics2 with 10 images on TPU v4-8
Time taken: 3 hours
Solution: Avg pool image latents https://huggingface.co/HuggingFaceM4/idefics2-8b/discussions/18. Reducing image_seq_length
from 64 to 32 allows me to feed 3 examples in one batch.
Bug 0010
When: May, 2024
Symptoms: Shape in compatibility in the backward pass in torch_xla. Similar to this
Time taken: 3 hours
Solution: Idefics2 had funny ways of expanding image tokens. I’m not sure why but using a longer max_length
for sequence helped.
Bug 0009
When: Apr, 2024
Symptoms: Could not use all four chips on a TPU v4-8 device with accelerate
library.
Time taken: 5 hours
Solution: Set num_processes=null
in default accelerate config. This will pass to xmp.spawn
for multi-process scripts, which signals null
for all chips available.
How: Stepping through tutorials like this. And really understand the mapping of one chip per process.
Bug 0008
When: Apr, 2024
Symptoms: Shape mismatch error in backward path when gradient checkpointing. Similar to this.
Time taken: 2 hours
Solution: Do not use_cache
if gradient checkpointing is enabled. PR
How:
- Omer pointed out this shape error is due to gradient checkpointing.
- Notice that previously Idefics had a similar patch to disable
use_cache
.
Bug 0007
When: Apr, 2024
Symptoms: Cannot find devices on TPU v4-32 VMs. Specifically, TPU driver finds the devices (/dev/accel*
exists) but no worker joins the slice (times out after 15m).
Time taken: 3 days
Solution:
- Use the recommended runtime tpu-ubuntu2204-base.
- Follow the correct tutorial. I was following Run a calculation on a Cloud TPU VM using PyTorch but actually v4-8 is a VM instance but v4-32 is a pod. So Run PyTorch code on TPU Pod slices is the one to go.
- A pod and a single VM are difference because a pod needs multiprocessing (kudos Yoav), and
xm.xla_device()
cannot stand alone in global scope. - So really, follow the correct tutorial solved the problem.
How:
sudo lsof -w /dev/accel0
- TPU logs
cat /tmp/tpu_logs/tpu_driver.INFO
andTPU_STDERR_LOG_LEVEL=0 TPU_MIN_LOG_LEVEL=0
- XLA logs
PT_XLA_DEBUG=1
and XLA trouble-shooting guidelines
Bug 0006
When: Mar, 2024
Symptoms: Cannot reproduce training results with/without gradient accumulation
Time taken: One week
Solution:
- Set seed before model initialization.
- Disable dropout layers (in the base model as well as the adapters). BatchNorm is also grad-accum unsafe.
- CrossEntropyLoss (with reduce=’mean’ by default) may not be averaged if the -100 masks are applied to the targets.
- Play with
dtype
. Floating point errors arise when I try to compute the loss infloat32
vsfloat16
. It makes things more complicated in gradients because I also used QLoRA. - Give up because training is only tractable with QLoRA in my case.
How: Observe numerical values. Set up two debuggers side by side.
Bug 0005
When: Mar, 2024
Symptoms: None. Agony from slow training of a language task.
Time taken: Two months
Solution:
- Instead of a fixed max sequence length, use dynamic padding.
- After getting the attention mask from the processor, reset attention mask of padding tokens to zero.
How: Observe numerical values.
Bug 0004
When: Mar, 2024
Symptoms: Cannot reproduce training results on the same hardware using the same configuration and seed.
Time taken: One day
Bug 0003
When: Feb, 2024
Symptoms: Cannot reproduce training results on the same hardware using the same configuration and seed. Train loss curves were close in the first 3 epochs, but diverged after. Same for validation curves.
Time taken: Two days
Solution: Downgrade transformers library.
How: Talk to kind experts
Bug 0002
When: Sometime in fall 2023
Symptoms: My chat bot serving online and offline gave different responses to the same requests.
Time taken: Two days
Solution: Online and offline models used different pre-processors.
How: Observe numerical values flowing into serving vs evaluation pipeline.
Bug 0001
(Fingers crossed I will not overflow 9999)
When: Sometime in spring 2020
Symptoms: I was building a prototype mobile robot. Voltage between one motor was consistently lower than expected.
Time taken: One hour
Solution: I put one battery in reverse.
How: Ask for help from TA