Checkpointing and Restarting
Introduction
wait all
Checkpointing a program
We have configured a one-hour grace period in SLURM to allow a preempted job to wrap up its work. With the grace period, when a job is selected for preemption, it is immediately sent SIGCONT and SIGTERM signals, followed by the SIGCONT, SIGTERM, and SIGKILL signal sequence upon reaching its new end time(ref). The first SIGTERM is sent to all job steps (a job step is started with srun in SLURM)(*). In order to catch the first SIGTERM to do checkpointing, a user program must be launched with srun.
The following simple example shows how to catch the SIGTERM and call a signal handler.
simple.job
#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu:a100:1
#SBATCH -n 1 --ntasks-per-node=1
date
srun ./simple.py
echo "finished"
simple.py
#!/usr/bin/env python3.11
import signal
import sys
import time
SIGTERM = 15
def signal_handler(sig, frame):
print('A signal has been caught.')
time.sleep(10)
signal.signal(SIGTERM, signal_handler)
time.sleep(600)
print('Done')
A more comprehensive example of checkpointing can be found at this GitHub repository of the Center for Neurocomputation and Machine Intelligenc e of WTI.
Requeuing
If you have checkpoint-and-restart implemented in your code, you can add the SLURM option --requeue
to your job. This option ensures that
when your job is preempted, it will return back to the queue automatically.
When the resources are available for the job again, it will restart from the last checkpoint without needing any user intervention.
Automtic checkpoint-and-restart
We now have all the ingredients to implement automatic checkpoint-and-restart. Here is a complete example demonstrating checkpointing at preempiton and automatic restarting when resources become available again.
References
Footnotes
* Note: This is not clear in the SLURM documentation. We only found it out through trial and errors.