HPC config

Recently, I have had a chance to build and test one of my largest project on HPC cluster. It took me sometime to correctly build and run it.

build

1
2
3
4
5
6
7
8
9
10
11
12
13
module purge
module load gcc/9.1.0
module load hdf5/intel/1.10.0p1
module load cmake/intel/3.11.4
module load mpfr/gnu/3.1.5
module load gmp/gnu/6.1.2
module load boost/intel/1.71.0
export CXX=$(which g++)
export CC=$(which gcc)
export GMP_DIR=${GMP_ROOT}
export MPFR_DIR=${MPFR_ROOT}
export BOOST_DIR=${BOOST_ROOT}
cmake -DGMP_LIBRARIES="/share/apps/gmp/6.1.2/gnu/lib/libgmp.so.10.3.2" -DGMP_INCLUDE_DIR="/share/apps/gmp/6.1.2/gnu/include" -DMPFR_LIBRARIES="/share/apps/mpfr/3.1.5/gnu/lib/libmpfr.so.4.1.5" -DMPFR_INCLUDES="/share/apps/mpfr/3.1.5/gnu/include" -DPython_LIBRARIES="/share/apps/python3/3.6.3/intel/lib/libpython3.6m.so" -DPYTHON_EXECUTABLE=“/share/apps/python3/3.6.3/intel/bin/python” ..

run

On a cluster, you can not simply run your binary for all day because it will simply get killed. Sometimes you may want to run a process thousands of times with different parameters.
On a batch system, a simple .sbatch file to achieve this looks like below

run.sbatch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/bin/bash
#BATCH --job-name=gen_feat
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16GB
#SBATCH --time=6:00:00

#SBATCH --output=/path/to/slutmoutput/%A_%a.out
#SBATCH --error=/path/to/err/%A_%a.err
module purge
module load gcc/9.1.0
module load hdf5/intel/1.10.0p1
module load cmake/intel/3.11.4
module load mpfr/gnu/3.1.5
module load gmp/gnu/6.1.2
module load boost/intel/1.71.0
export CXX=$(which g++)
export CC=$(which gcc)
export GMP_DIR=${GMP_ROOT}
export MPFR_DIR=${MPFR_ROOT}
export BOOST_DIR=${BOOST_ROOT}
binary=/path/to/you/bin

ulimit -c 0 # disable core dump
export OMP_NUM_THREADS=1
for i in {0..0}
do
( let "fid = ${SLURM_ARRAY_TASK_ID}"
dirname=$(sed ${fid}'q;d' /path/to/some.txt)
echo $fid
echo $dirname
$binary -u 1 -b 0.5 -t loop -d $dirname
)
done

submit

usually, there are job submission limit, a clevel way is to have python submit jobs for you and run this python script in a tmux session so that it won’t be killed.

submit.py

1
2
3
4
5
6
7
8
9
10
11
import subprocess
import time
start=0
end=10000
interval=500
for i in range(start, end, interval):
submit = f'sbatch --array={i}-{i+interval-1} run.sbatch'
print(submit)
while 1 == subprocess.run(submit.split(),stderr=subprocess.DEVNULL).returncode:
time.sleep(5)

check kill jobs

1
2
squeue -u username
scancel jobid