slurm is the cluster management and job scheduling system being used to submit jobs to the titan and dgx clusters of machines.
To submit jobs to the titan or dgx clusters, you should ssh into hypnotoad.cs.utexas.edu. If you do not have access to hypnotoad, that is because you do not have access to the titan or dgx clusters. This is not a public resources, and if you feel that you should have access to it, you will need to consult your advisor or professor.
At this time, there is no priority in place, and jobs are simply submitted FIFO. This decision was made by the owners of these machines, and not by the department, and may change in the future based on usage. Please bear that in mind as you are submitting your jobs. There are many people sharing these machines, and you are the only person enforcing your own fair share usage. If you are having a problem with a user you feel is monopolizing resources, please send email either to them or to your advisor requesting they consider the impact their jobs are having.
If you are having trouble with the software or the titan nodes themselves, please send mail to help@cs.utexas.edu.
To submit a job from hypnotoad, you need to first create a submission script. Here is a sample script that you can customize:
#!/bin/bash
#SBATCH --job-name slurmjob # Job name
### Logging
#SBATCH --output=logs/slurmjob_%j.out # Name of stdout output file (%j expands to jobId)
#SBATCH --error=logs/slurmjob_%j.err # Name of stderr output file (%j expands to jobId)
#SBATCH --mail-user=csusername@cs.utexas.edu # Email of notification
#SBATCH --mail-type=END,FAIL,REQUEUE
### Node info
#SBATCH --partition PARTITIONNAME # Queue name - current options are titans and dgx
#SBATCH --nodes=1 # Always set to 1 when using the cluster
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --time 1:00:00 # Run time (hh:mm:ss)
#SBATCH --gres=gpu:4 # Number of gpus needed
#SBATCH --mem=50G # Memory requirements
#SBATCH --cpus-per-task=8 # Number of cpus needed per task
./your_script.sh
If you named your script above slurm.sh, you can submit it to the titan cluster by running
sbatch slurm.sh
Once you have submitted your script, here are some commands you may find useful.
See all jobs currently submitted:
squeue
See all jobs currently submitted by a specific user:
squeue -u <username>
See all jobs submitted to the titan cluster since a specific start date:
sacct -S MM/DD/YY -r titans
See all jobs submitted to the titan cluster by a specific user since a specific start date:
sacct -S MM/DD/YY -r titans -u <username>
To see utilization per user on the titan cluster since slurm was deployed:
/lusr/opt/slurm/bin/slurm-usage
To see utilization per user on the titan cluster during a specific time period:
sreport cluster AccountUtilizationByUser Account=cs Start=MM/DD/YY End=MM/DD/YY
To cancel a running or queued job (you can find the job number with squeue):
scancel <job#>
To see the current state of the available nodes in slurm:
sinfo
More information can be found on how slurm works, useful commands, and applications on slurm's webpage.