Page tree
Skip to end of metadata
Go to start of metadata

The SLURM scheduler (Simple Linux Utility for Resource Management) manages and allocates all of Sol's compute nodes. All of your computing must be done on Sol's compute nodes. The following is an abbreviated user guide for SLURM. Please visit the SLURM website for a more detailed documentation of tools and capabilities.

Partitions

SLURM uses the term partition instead of queue. There are several partitions available on Sol for running jobs:

  • lts : 20-core nodes purchased as part of the original cluster by LTS.
    • Two 2.3GHz 10-core Intel Xeon E5-2650 v3, 25M Cache, 128GB 2133MHz RAM
  • lts-gpu: 1 core per lts node is reserved for launching gpu jobs
  • im1080 : 24-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 20 cores per node.
  • im1080-gpu : 2 cores per im1080 node is reserved for launching gpu jobs.
    • Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, Two EVGA Geforce GTX 1080 PCIE 8GB GDDR5
  • eng : 24-core nodes purchased by various RCEAS faculty.
  • eng-gpu : 2 cores per eng node is reserved for launching gpu jobs i.e. 1 core for each gpu.
    • Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, EVGA Geforce GTX 1080 PCIE 8GB GDDR5. Four nodes have two cards while other nodes have one card
  • engc : 24-core nodes based on Broadwell CPUs purchased by ChemE Faculty. Users can request a max of 24 cores per node until GPUs are added to these nodes.
    • Two 2.2GHz 12-core Intel Xeon E5-2650 v4, 30M Cache, 64GB 2133MHz RAM
  • himem : 16-core node purchased by Economics Faculty with 512GB RAM.
    • Two 2.6GHz 8-core Intel Xeon E5-2640 v3, 20M Cache, 512GB 2400MHz RAM
    • Users utilizing this node will be charged a higher rate of SU consumption ( 3 SU/core hour). Please evaluate memory consumption of your job before submitting jobs to this partition. If you need to use this partition, please contact Alex Pacheco.
  • enge,engi: 36-core node purchased by MEM  & Chem faculty and ISE Department
    • Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM
    • This node features the newer AVX512 vector extension that provides twice the FLOPS of earlier generation Haswell/Broadwell CPUs at the expense of CPU speed.
  • im2080: 36-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 28 cores per node.
  • im2080-gpu : 8 cores per im2080 node is reserved for launching gpu jobs i.e. 2 cores per gpu
    • Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM, Four ASUS GeForce RTX 2080TI PCIE 11GB GDDR6


Partition

Max Wallclock in hours

Min/Max Cores/Node per Job

Max SUs/Node consumed per hour

Max memory in GB per core

lts

72

1/19

19

6

lts-gpu721/20206

im1080

48

1/20

20

5

im1080-gpu

48

1/24

24

5

eng

72

1/22

22

5

eng-gpu

72

1/24

24

5

engc

72

1/24

24

2.5

enge

72

1/36

36

5

engi

72

1/36

36

5

himem

72

1/16

48

32

im2080

48

1/28

28

5

im2080-gpu

48

1/36

36

5

The himem partition is for running high memory jobs i.e. those requiring more than 6GB/core or for using the Artelys Knitro software. Do not submit jobs to the himem partition for running jobs that require lower memory per core. All jobs in the himem partition are charged 3 SUs per core hour of computing irrespective of how many cores or memory you consume.

Priorities

To ensure investors receive their allocation of resources while still maintaining a shared resources, each investor receives a priority boost on his/her investment. Every investor hotel or condo receives a base priority of 1 on all partitions. A priority boost of 100 is provided to investors and their collaborators on their investment. This ensures that an investors job will always start before other users. Jobs accumulate a priority of 1 for each day in the queue. A non investors job in a different partition would have to be in queue for 100 days before it can have a higher priority than an investors job. Below is a table listing the various investors and the partitions where they have priority. All Hotel investors get priority access on the lts partition.


InvestorPartition
Hotellts
Dimitrios Vavylonislts
Wonpil Imim1080, im1080-gpu, im2080,im2080-gpu
Anand Jagotaeng
Brian Cheneng
Edmund Webb IIIeng
Alparslan Oztekineng
Jeetain Mittallts-gpu,eng-gpu
Srinivas Rangarajanengc
Seth Richards-Shubikhimem
Ganesh Balasubramanianenge
Industrial and Systems Engineeringengi
Lisa Fredinenge
Paolo Bocchiniengc
Hannah Daileyenge

   

Current Status

Current status of partitions and load on nodes is updated every 15 mins. Do not bookmark for off campus use, accessible on campus and VPN.

Usage

Usage reports for current and past allocation cycles. Do not bookmark for off campus use, accessible on campus and VPN.

Detailed Annual Reports with consumption of resources by users and research groups. Do not bookmark for off campus use, accessible on campus and VPN. Some pages may take a while a load due to amount of data reported.


File Systems

There are three distinct file spaces on Sol.

  • HOME, your home directory on Sol
  • SCRATCH, scratch storage on the local disk associated with your running job.
  • CEPHFS, global parallel scratch for running jobs with a lifetime of 7 days.
  • CEPH, Ceph project space for research groups that have purchased a minimum 1TB Ceph project

HOME Storage

All Sol users are provided with a 150GB storage quota at /home/username and accessible using the environmental variable $HOME. Home storage is a large Ceph project that is not backed up. It is the users responsibility to maintain backups of their data in $HOME$HOME directories are not deleted as long as annual user account fees are paid by the HPC PIs. 

SCRATCH Storage

SCRATCH provides a 500GB storage on the local disk on the nodes associated with running jobs. This space is not backed up or snapshotted and is deleted when jobs are completed. A user can access this space while running jobs at /scratch/$SLURM_JOB_USER/$SLURM_JOB_ID. Since compute nodes are shared among different users, the available disk space could be less than 500GB. Users who use the SCRATCH space need to make sure that data is copied back at the end of their jobs. Since the scheduler purges the SCRATCH storage at the end of a job, data that hasn't been copied cannot be recovered. See below for a sample script using SCRATCH storage.

Using Local Scratch for MD simulation

CEPHFS global parallel scratch

CEPHFS provides a 11TB global parallel scratch storage. This space is not backed up or snapshotted and all files older than 7 days are deleted. A user can access this space at /share/ceph/scratch/$USER/$SLURM_JOB_ID for running jobs and for 7 days after the job has completed. The SLURM scheduler automatically creates this directory. Users can use this space for writing parallel job output that needs a longer lifetime than that provided by SCRATCH. Since this storage is serviced by SSDs on the Ceph storage cluster, using CEPHFS provides better read/write performance than HOME and CEPH storage spaces. It is the users responsibility to backup data within 7 days of your job completing.

CEPH Storage

Lehigh Research Computing provides Ceph projects for research groups that require more storage than the 150GB provided to each HPC account. HPC PIs can add their collaborators to their Ceph project that can be used a storage space located at /share/ceph/projectname on Sol. Users should keep in mind that all Ceph projects including $HOME is a networked file system and writing job output to these filesystem could affect the performance of your jobs. Ceph projects should be used for storage and all workloads that contain intense Input and Output should use the SCRATCH or CEPHFS global scratch storage.


Running Jobs on Sol

You must be allocated at least one Sol compute node by SLURM to run jobs. Running compute intensive workload (i.e. anything other than editing files, submitting and monitoring jobs) on the head/login node is strictly prohibited. Users will need to write a script requesting desired resources from SLURM.

Migrating from PBS to SLURM

The following is a comparison between PBS and SLURM commands to aid users in migrating their submit scripts used on Corona to Sol

Need a cheatsheet

Script Directive

#PBS

#SBATCH

Queue

-q queuename

--partition=partitionname

Node Count

-l nodes=count

--nodes=count

CPU Count

-l ppn=count

--ntasks-per-node=count

Wall Time Limit

-l walltime=mm:ss

--time=mm:ss

Memory Size

-l mem=MB

--mem=memM,G,T

Copy Environment

-V

--export=All,None,Variables

Standard Output

-o file name

--output=file name

Standard Error

-e file name

--error=file name

Combine stdout & stderr

-j oe (both to stdout)

Default directive if --error is not specified

Job Name

-N jobname

--job-name=jobname

Job Restart

-r y,n

--requeue or --no-rqueue

Event Notification

-m a,b,e

--mail-type=BEGIN,FAIL,END,ALL

Email Address

-M address

--mail-user=address

Commands

PBS/Torque

SLURM

Job Submission

qsub script filename

sbatch script filename

Job Deletion

qdel jobid

scancel jobid

Job Status by id

qstat jobid

squeue jobid

Job Status by user

qstat -u username

squeue -u username

Job hold

qhold jobid

scontrol hold jobid

Job release

qrls jobid

scontrol release jobid

Queue list

qstat -q

sinfo -s

Node list

pbsnodes -l

sinfo --Node or scontrol show nodes

Cluster status

qstat -a

sinfo

Environment Variables

PBS/Torque

SLURM

Job ID

PBS_JOBID

SLURM_JOB_ID

Submit Directory

PBS_O_WORKDIR

SLURM_SUBMIT_DIR

Nodelist

PBS_NODEFILE

SLURM_JOB_NODELIST

Job Name

PBS_JOBNAME

SLURM_JOB_NAME

Number of Processors

PBS_NP

SLURM_NTASKS

Number of Nodes

PBS_NUM_NODES

SLURM_JOB_NUM_NODES

Number of Processors per Node

PBS_NUM_PPN

SLURM_NTASKS_PER_NODE

Job Queue/Partition

PBS_O_QUEUE

SLURM_JOB_PARTITION


There are two types of job that can be run on Sol

  1. Interactive Jobs
  2. Batch Jobs


Interactive Jobs

These are jobs that provide an interactive environment or command line prompt on which users can enter commands to run simulations. These are best when used for testing and debugging and are not appropriate for long running production jobs. Resources can be requested using the srun command  with at least one option to launch a pseudo terminal --pty /bin/bash. Other options include partition, number of nodes,  tasks per node and time

Interactive Job on lts partition requesting 1 cpu for 1 hour
srun --partition=lts --nodes=1 --ntasks-per-node=1 --time=60 --pty /bin/bash

When a resource becomes available, SLURM will provide you with a command prompt on the compute node you are allocated. Until resource is available, you will have no access to use the command prompt on the shell where the above command is executed. If you cancel the command using CNTRL-C, your interactive job request will be cancelled. Depending on how busy the cluster is, your wait could be a few minutes to a few days.

All compute nodes have a naming convention sol-[a-e][1-6][00-18], for e.g. sol-a104. Do not run jobs on the head/login node i.e. sol.

Batch Jobs

These are jobs that require writing a series of command in a shell script that SLURM will execute on the compute node. Resources can be requested in the script or as options to the command, sbatch, while submitting the script to the SLURM scheduler.

Sample Scripts for Batch Jobs


Serial Job
#!/bin/bash
# Sample script for submitting a serial job 
# on lts partition using 1 core per node
#  for 1 hour. 

# Use all-cpu Partition (using PBS convention, lts queue)
#SBATCH --partition=lts

# Request 1 hour of computing time
#SBATCH --time=1:00:00

# Request 1 core Serial jobs cannot use more than 1 core
# However, if the memory required exceeds RAM/core then request
# more tasks but do not use more than 1 core
# Partition:Max RAM/Core in Gb
# lts: 6.4
# eng/im1080/im1080-gpu: 5.3
# engc: 2.6
#SBATCH --ntasks=1


# Give a name to your job to aid in monitoring
#SBATCH --job-name myjob

# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"

# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu


# cd to directory where you submitted the job
# or directory where you want to run the job
cd ${SLURM_SUBMIT_DIR}

# launch job
./myjob < filename.in > filename.out

# Alternatively, you can run your jobs through srun
# However, if your serial job requires more memory than 
# that allotted per core and you have requested > 1 core,
# then add -n 1 flag to srun to avoid running multiple copies
srun -n 1 ./myjob < filename.in > filename.out


exit
OpenMP Job
#!/bin/bash
# Sample script for submitting OpenMP job 
# on im1080 partition using 1 nodes, 12 cores per node
#  for 1 hour. Users can request upto 22 cores per node
#  in the im1080 partition

# Use im1080 Partition (using PBS convention, im1080 queue)
#SBATCH --partition=im1080

# Request 1 hour of computing time
#SBATCH --time=1:00:00

# Request 1 node, OpenMP cannot use more than 1 node
#SBATCH --nodes=1

# Request upto 22 cores on the node
# The im1080 partition has 2 GPUs per node and 
#   one core is reserved for each GPU
# You can use up to 20 cores on the lts partition and
#   up to 24 cores on the eng partition  
#SBATCH --ntasks-per-node=12


# Give a name to your job to aid in monitoring
#SBATCH --job-name=myjob

# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"

# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu

# Setup Environment for OpenMP
# Specify number of OpenMP Threads
export OMP_NUM_THREADS=12

# cd to directory where you submitted the job
# or directory where you want to run the job
cd /scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}

# launch job assuming myjob is present at ${SLURM_SUBMIT_DIR}
${SLURM_SUBMIT_DIR}/myjob < ${SLURM_SUBMIT_DIR}/filename.in > filename.out


# Alternatively, you can specify number of OpenMP Threads at launch
OMP_NUM_THREADS=12 ${SLURM_SUBMIT_DIR}/myjob < ${SLURM_SUBMIT_DIR}/filename.in > filename.out


# Copy output file at the end of your job
# For jobs that contain only one output
cp filename.out ${SLURM_SUBMIT_DIR}
# If you are creating multiple output files, 
# you can use wildcards or rsync at the end of your job
rsync -avtz * ${SLURM_SUBMIT_DIR}/


exit
MPI Job
#!/bin/bash
# Sample script for submitting MPI job 
# on lts partition using 2 nodes, 20 cores per node
#  for 1 hour

# Use lts Partition (using PBS convention, lts queue)
#SBATCH --partition=lts

# Request 1 hour of computing time
#SBATCH --time=1:00:00

# Request 2 nodes
#SBATCH --nodes=2

# Request all 20 cores on the node
#SBATCH --ntasks-per-node=20

# Give a name to your job to aid in monitoring
#SBATCH --job-name myjob

# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"

# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu

# Load mvapich2 module, by default mvapich2/2.1/intel-16.0.3
module load mvapich2

# cd to directory where you submitted the job
# or directory where you want to run the job
cd ${SLURM_SUBMIT_DIR}

# Use SLURM's srun command. 
# It contains information about allocated nodes and processors
# and can launch job without the need to specify them
srun ./myjob < filename.in > filename.out


exit
GPU Job
#!/bin/tcsh
#SBATCH --partition=im1080-gpu
# Directives can be combined on one line
#SBATCH --time=1:00:00
#SBATCH --nodes=1
# 1 CPU can be be paired with only 1 GPU
# GPU jobs can request all 24 CPUs
#SBATCH --ntasks-per-node=1
# Request one GPU for your workload
#SBATCH --gres=gpu:1
# Need both GPUs, use --gres=gpu:2
#SBATCH --job-name myjob

cd /scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}
# Copy input and miscellaneous files to run directory
cp ${SLURM_SUBMIT_DIR}/* .

# Load LAMMPS Module
module load lammps/17nov16-gpu

# Run LAMMPS for input file in.lj
srun $(which lammps) -in in.lj -sf gpu -pk gpu 1 gpuID ${CUDA_VISIBLE_DEVICE} ${CUDA_VISIBLE_DEVICE}

# Copy output back to ${SLURM_SUBMIT_DIR} in a subfolder
cd ${SLURM_SUBMIT_DIR}/
mv /scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID} .

# Note that there is no guarantee which device will be assigned to your job.
# If you use 0 or 1 instead of ${CUDA_VISIBLE_DEVICE}, your jobs will be utilizing
#  GPUs assigned to another user
# NAMD: Add "+devices ${CUDA_VISIBLE_DEVICE}" as a command line flag to charmrun
# GROMACS: Add "-gpu_id ${CUDA_VISIBLE_DEVICE}" as a command line flag to mdrun
# If you request both GPUs, then
# LAMMPS: -pk gpu 2 gpuID 0 1
# NAMD: +devices 0,1
# GROMACS: -gpu_id 01

Submitting Jobs

To submit a job, run the command 
sbatch slurmjob.sh
sbatch can take command line arguments that would otherwise be added to the submit script. For example, to request a job for 12 hours and 4 nodes on the lts partition
sbatch --time=12:00:00 --partition=lts --nodes=4 --ntasks-per-node=20 slurmjob.sh

Command line options to sbatch override #SBATCH commands in the submit script.


Submitting Dependency jobs

You want to run a long simulation that is split into multiple sequential runs to fit within the maximum walltimes of the partitions. One common method is to create job submission script for each of the sequential steps that will be submitted by the previous job or submitted manually when the previous job is complete. The former method is not recommended since some systems do not allow job submission from the compute nodes (you might encounter the same issues on national resources as very few systems have queue walltimes larger than 7 days) or if you run out of walltime, then the subsequent job may not be submitted. In the latter method, you lose valuable time if you are not monitoring your jobs and are not available to submit the subsequent job.

The recommended method is to submit jobs with a dependency attribute for the second and subsequent jobs. On Sol and any system that uses the SLURM job scheduler, dependency jobs are created by adding the --dependency=... flag to the sbatch command.

sbatch --dependency=afterok:<JobID> <Submit Script>

Here, you are submitting a SLURM script <Submit Script> that depends on a previous job with ID <JobID>. Options that can be added to the dependency argument are

  • afterok:<JobID> Job will be scheduled to run only if Job <JobID> had completed with no errors
  • afternotok:<JobID> Job will be scheduled to run only if Job <JobID> has completed with errors
  • afterany:<JobID> Job will be scheduled to run after Job <JobID> has completed, with or without errors

Abbreviated Notations

SLURM also accepts abbreviated notation for sbatch command

Long Format

Short Format

--paritition=name

-p name

--time=mm:ss

-t mm:ss

--nodes=number

-N number

--ntasks=total procs

-n total procs

--dependency=attributes

-d attributes

Monitoring Jobs

SLURM provides various tools for monitoring and manipulating jobs

Check queue status

squeue <Options>


Options

  • -u <username>: show status of all jobs for a particular user
  • -j <jobid>: show status for jobid
  • -l: show long format of queue status
  • -p <name>: show status of all jobs in paritition name
  • -s: show estimated start time

Use --help option to see a full list of allowed options and usage

checkq is a script accessible through the soltools modules which provides squeue with some useful defaults and can accept the above options.

Cancel/delete a job

You can only delete only your jobs that are in queue or already running

scancel <jobid>

Manipulate Jobs in Queue

A user or admin can manipulate jobs that are in queue i.e. not running yet.

Hold a job
scontrol hold <jobid>
Release a held job
scontrol release <jobid>

You can only release jobs that you have held. If an admin has held your job, only the admin can release it.

Show job details
scontrol show job <jobid>
Modify a job after submission
scontrol update SPECIFICATION jobid=<jobid>


Examples of SPECIFICATION are

  • add dependency after a job has been submitted: dependency=<attributes>
  • change job name: jobname=<name>
  • change partition: partition=<name>
  • modify requested runtime: timelimit=<hh:mm:ss>
  • request gpus (when changing to one of the gpu partitions): gres=gpu:<1,2,3 or 4>
SPECIFICATIONs can be combined for e.g. command to move a queued job to im1080 partition and change timelimit to 48 hours for a job 123456 is
scontrol update partition=im1080 timelimit=48:00:00 jobid=123456


Monitoring Queues

Display queue/partition names, runtimes and available nodes
alp514.sol(511): sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lts* up 3-00:00:00 9 idle sol-a101-109
im1080 up 2-00:00:00 24 alloc sol-b401-413,501-511
im1080 up 2-00:00:00 1 idle sol-b512
Display runtimes and available nodes for a particular queue/partition
alp514.sol(512): sinfo -p lts,im1080
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lts* up 3-00:00:00 9 idle sol-a101-109
im1080 up 2-00:00:00 24 alloc sol-b401-413,501-511
im1080 up 2-00:00:00 1 idle sol-b512


checkload is a script accessible through the soltools modules which provides sinfo with some useful defaults and can accept the above options.

Click Here for status of Sol partitions - updated every 15 mins, accessible at Lehigh and VPN only. This page is generated from output of checkq and checkload for partition status and node usage respectively.

  • No labels