CheckFreq

Implementing CheckFreq’s asynchronous ML checkpointing in Darknet #

I have recently implemented CheckFreq, an asynchronous checkpointing library for Deep Neural Networks (DNN), on top of Darknet, an open source neural network framework written in C and CUDA. Below are some high-level details about the implementation and one important lesson about CUDA concurrency.

Darknet handles model checkpointing synchronously, where a checkpoint can only start after the weight update completes in an iteration and the next iteration does not start until the checkpoint operation completes. In contrast, CheckFreq proposes an two-stage asynchronous checkpointing method where the checkpoint operation executes in the background and only the weight update operation of the next iteration has to wait until the first stage (i.e., the snapshot operation that copies the model state in new memory pages in GPU or CPU) completes. This wait is called the snapshot stall. The second stage (i.e., the persist stage that writes the model state in GPU or CPu memory to files in persistent storage) can execute in the background without blocking any other operation. Implementing this correctly in Darknet involved three things.

Background threads
Thread synchronization using conditional variables
CUDA streams for concurrency

I created two threads – one for training and one for checkpointing – from the main() function in Darknet. I used a global variable keeps track of checkpoint state and two conditional variables for synchronizing the training and checkpointing threads – one for signalling weight update has finished and a new checkpoint can be triggered and another for signalling snapshot has completed and the weight update can start. Implementing the synchronization between two threads needed some effort due to multiple bugs related to checkpoint state and memory access. Once the bugs are rooted out, I saw that the overall iteration time with and without snapshot stalls are as expected.

However, what was really challenging was to figure out a CUDA concurrency issue that was stalling the wrong operation in the training thread. The timing information of each operation inside an iteration – data loading time, forward-backward pass, weight update, and snapshot stall – showed that weight updates were not stalling while a snapshot operation was ongoing. Instead, it is the forward-backward pass operation that was stalling. The issue stems from the use of a single CUDA stream for data transfer in each GPU in Darknet. Since the snapshot operation, training data loader and the forward-backward pass uses the same stream for data transfer, data for the next iteration doesn’t complete the host to device transfer of input training data until the snapshot operation finishes its device to host data transfer of model state. Once I identified this issue, the solution was simple enough – use separate CUDA streams to enable concurrent data transfers for two independent operations – snapshot and data loader.

I’ll put the code online some time in future.

Building CheckFreq Docker image #

Use CC-Ubuntu20.04-CUDA-20230831.1 image in Chameleon Cloud.
Install Docker and Nvidia Container Toolkit.

# https://docs.docker.com/engine/install/ubuntu/
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

# Install the latest version
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-20-04
# To use docker command without sudo. Logout and log back in after executing the following command.
sudo usermod -aG docker ${USER}

# Install Nvidia Docker Toolkit
sudo apt-get install -y nvidia-container-toolkit

Pull Dokcer image nvcr.io/nvidia/pytorch:19.05-py3.

docker pull nvcr.io/nvidia/pytorch:19.05-py3

Pull DS-Analyzer(optional), CoorDL and CheckFreq repositories. Copy patch for resumable iterator from CheckFreq repo to CoorDL repo. Apply the patch.

git clone [email protected]:msr-fiddle/DS-Analyzer.git
git clone --recursive https://github.com/msr-fiddle/CoorDL
git clone [email protected]:msr-fiddle/CheckFreq.git
cp CheckFreq/dl_patch/resumable_iterator.patch CoorDL/
cd CoorDL
git apply resumable_iterator.patch

Update Dockerfile named Docker_run_cuda in CoorDL repo. Comment line 34. Add hdparm in the install list.
Update docker/build.sh in CoorDL repo. Update CUDA_IMAGE_NAME in line 200.

export CUDA_IMAGE_NAME="nvcr.io/nvidia/pytorch:19.05-py3"
#export CUDA_IMAGE_NAME="nvcr.io/nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04"

Build the docker image py36_cu10.run.

CREATE_RUNNER="YES" ./build.sh

Run the Docker image using the following command.

docker run --gpus all --ipc=host --mount src=/,target=/datadrive/,type=bind -it --rm --network=host --privileged nvidia/dali:py36_cu10.run

Useful commands:

# restart Docker 
sudo systemctl restart docker
# View Docker images 
docker image ls
# Delete docker image by ID
docker image rm <IMAGE ID>

Some processes running on GPU may terminate inside the container but continue running on the host side. Use sudo kill <pid> to kill the process.

Chameleon’s Ubuntu image mentioned in step 0 can throw Failed to initialize NVML: Driver/library version mismatch error at some point. To prevent this from happening in Ubuntu, stop automatical update of NVIDIA drivers. Add the following in sudo nano /etc/apt/apt.conf.d/50unattended-upgrades.

Unattended-Upgrade::Package-Blacklist {
        "nvidia-";
        "libnvidia-";
        ...
}