HPC config
Recently, I have had a chance to build and test one of my largest project on HPC cluster. It took me sometime to correctly build and run it.
build
1 | module purge |
run
On a cluster, you can not simply run your binary for all day because it will simply get killed. Sometimes you may want to run a process thousands of times with different parameters.
On a batch system, a simple .sbatch file to achieve this looks like below
run.sbatch1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#BATCH --job-name=gen_feat
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16GB
#SBATCH --time=6:00:00
#SBATCH --output=/path/to/slutmoutput/%A_%a.out
#SBATCH --error=/path/to/err/%A_%a.err
module purge
module load gcc/9.1.0
module load hdf5/intel/1.10.0p1
module load cmake/intel/3.11.4
module load mpfr/gnu/3.1.5
module load gmp/gnu/6.1.2
module load boost/intel/1.71.0
export CXX=$(which g++)
export CC=$(which gcc)
export GMP_DIR=${GMP_ROOT}
export MPFR_DIR=${MPFR_ROOT}
export BOOST_DIR=${BOOST_ROOT}
binary=/path/to/you/bin
ulimit -c 0 # disable core dump
export OMP_NUM_THREADS=1
for i in {0..0}
do
( let "fid = ${SLURM_ARRAY_TASK_ID}"
dirname=$(sed ${fid}'q;d' /path/to/some.txt)
echo $fid
echo $dirname
$binary -u 1 -b 0.5 -t loop -d $dirname
)
done
submit
usually, there are job submission limit, a clevel way is to have python submit jobs for you and run this python script in a tmux session so that it won’t be killed.
submit.py1
2
3
4
5
6
7
8
9
10
11import subprocess
import time
start=0
end=10000
interval=500
for i in range(start, end, interval):
submit = f'sbatch --array={i}-{i+interval-1} run.sbatch'
print(submit)
while 1 == subprocess.run(submit.split(),stderr=subprocess.DEVNULL).returncode:
time.sleep(5)
check kill jobs
1 | squeue -u username |