HPC config

Posted on 2020-04-26 Edited on 2023-02-21

Recently, I have had a chance to build and test one of my largest project on HPC cluster. It took me sometime to correctly build and run it.

build

module purge
module load gcc/9.1.0
module load hdf5/intel/1.10.0p1
module load cmake/intel/3.11.4
module load mpfr/gnu/3.1.5
module load gmp/gnu/6.1.2
module load boost/intel/1.71.0
export CXX=$(which g++)
export CC=$(which gcc)
export GMP_DIR=${GMP_ROOT}
export MPFR_DIR=${MPFR_ROOT}
export BOOST_DIR=${BOOST_ROOT}
cmake -DGMP_LIBRARIES="/share/apps/gmp/6.1.2/gnu/lib/libgmp.so.10.3.2" -DGMP_INCLUDE_DIR="/share/apps/gmp/6.1.2/gnu/include" -DMPFR_LIBRARIES="/share/apps/mpfr/3.1.5/gnu/lib/libmpfr.so.4.1.5" -DMPFR_INCLUDES="/share/apps/mpfr/3.1.5/gnu/include" -DPython_LIBRARIES="/share/apps/python3/3.6.3/intel/lib/libpython3.6m.so" -DPYTHON_EXECUTABLE=“/share/apps/python3/3.6.3/intel/bin/python” ..

run

On a cluster, you can not simply run your binary for all day because it will simply get killed. Sometimes you may want to run a process thousands of times with different parameters.
On a batch system, a simple .sbatch file to achieve this looks like below

run.sbatch

#!/usr/bin/bash
#BATCH --job-name=gen_feat
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16GB
#SBATCH --time=6:00:00

#SBATCH --output=/path/to/slutmoutput/%A_%a.out
#SBATCH --error=/path/to/err/%A_%a.err
module purge
module load gcc/9.1.0
module load hdf5/intel/1.10.0p1
module load cmake/intel/3.11.4
module load mpfr/gnu/3.1.5
module load gmp/gnu/6.1.2
module load boost/intel/1.71.0
export CXX=$(which g++)
export CC=$(which gcc)
export GMP_DIR=${GMP_ROOT}
export MPFR_DIR=${MPFR_ROOT}
export BOOST_DIR=${BOOST_ROOT}
binary=/path/to/you/bin

ulimit -c 0 # disable core dump
export OMP_NUM_THREADS=1
for i in {0..0}
do
   ( let "fid = ${SLURM_ARRAY_TASK_ID}"
    dirname=$(sed ${fid}'q;d' /path/to/some.txt)
    echo $fid
    echo $dirname
    $binary -u 1 -b 0.5 -t loop -d $dirname
    )
done

submit

usually, there are job submission limit, a clevel way is to have python submit jobs for you and run this python script in a tmux session so that it won’t be killed.

submit.py

import subprocess
import time
start=0
end=10000
interval=500
for i in range(start, end, interval):
  submit = f'sbatch --array={i}-{i+interval-1} run.sbatch'
  print(submit)
  while 1 == subprocess.run(submit.split(),stderr=subprocess.DEVNULL).returncode:
      time.sleep(5)

check kill jobs

1 2	squeue -u username scancel jobid