Skip to main content

A beginner's guide to ETH's Euler

·2224 words·11 mins
Table of Contents
2024 Update: Euler is now using Slurm instead of LSF. While this blog post can this be still useful, the LSF commands need to be translated into Slurm commands. See this quick reference.

🖍 Nota bene: You will need to be an ETH member to use Euler. However, this post might still be useful if you happen to be working on another cluster which uses the LSF batch system. Also relevant: I am currently mainly using MATLAB to run my simulations, so you’ll find some MATLAB specific advice. For Python specifics, check out this page.

⌛️ TL;DR: check out the end of the blog post for a practical suggestion on how to structure your job submission scripts.

Euler ?
#

Euler is the high performance computing cluster at ETH Zurich.

Like its predecessors, Euler follows a shareholder model, and is for the most part financed by its users. These “shareholders” then receive a share of the cluster’s resources (i.e CPU time), which is proportional to their investment. However, a small share of Euler is financed by the IT Services and is open to all ETH members.

And that’s a huge hooray for students!

Euler is really a great resource, but can be associated with a steep learning curve, especially if you are not used to working with a terminal and all that jazz.

In this post, we’ll go through:

  • How to run your first job
  • Some advanced tricks
  • A template for future jobs

Running a simple job
#

Connecting to Euler
#

The first thing you need to do is to connect to Euler. To do that, you either need to be within the ETH network or connected via VPN. The only requirement to login is a valid ETH account (no forms to fill out 🎉). Open a terminal and type:

ssh username@euler.ethz.ch

Now type in your NETHZ password… and you’re in! If this is the first time you’re logging in, the only extra step is to accept the cluster’s usage rights.

File transfer
#

To run your jobs, you’ll need to transfer your scripts and data to the cluster. To do so, you can use the scp command. The general use is as follows:

scp [options] source destination

Specifically, what you’ll need to type is:

scp file username@euler.ethz.ch:path/to/destination # from local computer to Euler
scp username@euler.ethz.ch:path/to/source/file path/to/destination # from Euler to local computer

To transfer directories, the option -r needs to be used:

scp -r directory username@euler.ethz.ch:path/to/destination # from local computer to Euler
scp -r username@euler.ethz.ch:path/to/source/directory path/to/destination # from Euler to local computer

You’ll find that acquainting yourself with bash scripting will come in handy for this type of commands. Check out the cheat sheet I left in the end of the blog post for just that!

A word on data storage
#

When it comes to personal storage, every user has access to two options:

HOME=/cluster/home/username
SCRATCH=/cluster/scratch/username

Home is a safe long-term storage for important data. Only you can access it and the content is saved every hour (snapshot) and every night (tape backup). You are allowed 16 GB of data or a maximum of 100 000 files. This is the directory (folder) you’ll automatically find yourself in once you login.

Scratch, on the other hand, is designed as a fast short-term storage. There is no backup and ETH will warn you that it is purged on a regular basis. Your disk quota in scratch rises to 2.5 TB and a maximum of 1 000 000 files.

To check how far away (or how close…) you are from your quota, use the command lquota:

lquota
+-----------------------------+-------------+------------------+------------------+------------------+
| Storage location:           | Quota type: | Used:            | Soft quota:      | Hard quota:      |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/home/user          | space       |          8.85 GB |         17.18 GB |         21.47 GB |
| /cluster/home/user          | files       |            25610 |            80000 |           100000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/shadow             | space       |          4.10 kB |          2.15 GB |          2.15 GB |
| /cluster/shadow             | files       |                2 |            50000 |            50000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/scratch/user       | space       |        237.57 kB |          2.50 TB |          2.70 TB |
| /cluster/scratch/user       | files       |               29 |          1000000 |          1500000 |
+-----------------------------+-------------+------------------+------------------+------------------+

For more information on data storage (e.g. options for shareholders), be sure to check out this presentation or this page.

Using Euler
#

All right! Now we’re ready to start using the cluster! And it all boils down to this:

bsub [LSF options] job

Of course, the tricky part is to figure out how to use the options and what constitutes a job. Well, here are a few options I find particularly useful (for more, check out this reference):

OptionDescription
-W HH:MMAllowed duration for job. Default: 04:00 hours.
-R "rusage[mem=X]"Memory (in MB per core) required. Default: X = 1024 MB.
-J jobnameAssigns name to job. Not necessarily unique!
-w "depcond"Sets dependencies between jobs. More on this below.
-rIf your node crashes, LSF will re-run the job on another node.

Now mind you that a job can be:

  • a Linux command: bsub [LSF options] command
  • a shell script: bsub [LSF options] < script
  • a program: bsub [LSF options] path/to/program

Ok, so far so good. Now, to use MATLAB, you’ll first need to load it.

module load new matlab/R2018a

Notice that I am loading a specific version of MATLAB. For more information on how to configure the MATLAB version you want to run, check out this link.

Once that’s done, let’s say you would like to fit a model using a script you wrote beforehand. To submit your job, you then type:

bsub -J "fitModel" matlab -singleCompThread -nodisplay -r "fit_model"

Pardon? What are all those options? It’s fairly easy:

  • -singleCompThread: forces MATLAB to use only one thread for its computations (prevents MATLAB from overloading the compute nodes).
  • -nodisplay: just explicitly tell MATLAB no graphical display is available.

⚠️ #MatlabProTip: the -r option must be followed by a function. That means that you should omit the .m at the end of your script name. This can be confusing or easily overlooked if your function doesn’t take in any arguments (as in the example above), and has once cost me an embarrassing amount of time.

⚠️⚠️ #Update: the -r option is deprecated in Matlab 2019a and newer versions. Instead, the -batch option should be used. For more on this, check out this link.

You have submitted your job and get back:

Generic job.
Job <94709747> is submitted to queue <normal.4h>.

That looks good. To manage your jobs, you’ll then need two new commands: bjobs and bkill, which let you monitor or kill all jobs or specific jobs (for more details, read this reference).

bjobs
> JOBID      USER   STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
> 94709747   jdoe   RUN   normal.4h   euler01  eu-c7-106-0  fitModel   Jul  3 07:19

bkill 0 # terminate all jobs
> Job <94709747> is being terminated

An alternative to bjobs is bbjobs (aka better bjobs).

More advanced tricks
#

Let’s take things up a notch. Let’s say that you need to run jobs which somehow depend on each other. The example I’ll use here is a greedy feature selection algorithm. Don’t mind the details, the important thing is you need to do your computations in three stages:

  1. Run a setup function (e.g. restructure some folders)
  2. Train your models
  3. Choose the best model

Now that means that you only want step 2 to kick in once the setup is ready. That’s called job chaining and can be done as follows:

bsub -J job1 command1
bsub -J job2 -w "done(job1)" command2
bsub -J job3 -w "done(job2)" command3

Note however that the condition for job2 is only fulfilled if job1 is completed successfully. If, for some reason, job1 crashes or is killed because it reached the runtime or memory limit, then the dependency for job2 is never true and job2 will stand, pending, forever… If you actually, for some heartless reason, don’t really care what happens to job1, then you should write:

bsub -J job1 command1
bsub -J job2 -w "ended(job1)" command2

If step 2 involves a suite of models and you want to only start step 3 once all models in step 2 are finished, then you might run into a problem.

If you only have two models and don’t mind writing out all the dependencies, you can opt for:

bsub -J job1 command1 # train model 1
bsub -J job2 command2 # train model 2
bsub -J job3 -w "done(job1) && done(job2)" command3 # choose best model

If, on the other hand, you are fitting a lot of models in step 2, you should consider using job arrays.

bsub -J job[1-10] command

-J job[1-10] defines the name of the array and the number of jobs it will contain (in this case: 10). You then need to get the JOBID of this job array. This is because the JOBID will be the same for the whole job array, but different subjobs will have different names. To get the JOBID, one way is to use bjobs and just copy the JOBID (let’s say: 94709747). Then, you can write:

bsub -w "numdone(94709747,*)" -J job2 command2

However, if you prefer (I would) to have the JOBID extracted automatically, then this is what you want:

JOBID=$(bsub command1 | awk '/is submitted/{print substr($2, 2, length($2)-2);}')
if [ -n "$JOBID" ]; then
        bsub -w "numdone($JOBID,*)" command2
fi

The first line of code will run the job it contains and extract the job’s ID. Filling two needs with one deed 💪. Oh wait, but what if I need to pass the array index as an argument to a function? That can also be done! You can pass $LSB_JOBINDEX as a command-line argument. Note the backslash \ before $LSB_JOBINDEX:

bsub -J "job[1-10]" matlab -nodisplay -singleCompThread -r "my_function(\$LSB_JOBINDEX)"

💣 Ok, now an even harder thing to do would be to set a dependency on multiple job arrays. That can also be done, with a few bash tricks:

dep_cond="" # create a variable to store the full dependency
for i in {1..50}; do
  JOBID=$(bsub -J "job$i[1-10]" matlab -nodisplay -singleCompThread -r "my_function(\$LSB_JOBINDEX)" | awk '/is submitted/{print substr($2, 2, length($2)-2);}')
  dep_cond+=" && numdone($JOBID,*)" # add new dependencies
done

dep_cond=${dep_cond:4} # remove leading " && "

bsub -w "$dep_cond" matlab -singleCompThread -nodisplay -r "my_function2()"

A huge thank you to the Euler cluster support for this one! In particular to Oliver Byrde!

A modest template
#

Here is a template of the type of scripts I like. Notice the following:

  • Use of $1: Taking in arguments from the command line is, in my opinion, really nice. It saves you the time of opening vim and having to edit, save, quit. You just type your arguments and click enter!
  • Use of echo with if/else statements to check if everything is running as planned.
  • Requirement of user confirmation. Some might find this annoying, but I have often realized, when submitting the job, that something was wrong. Requiring me to type in a confirmatory “yes” adds an extra check.

This script reconstructs the steps we had seen before, but with multiple iterations added, hence the for loop.

#!/bin/bash

# Loading matlab module
echo loading matlab module...
module load new matlab/R2017b

# Necessary input
arg1=$1 # takes in first input argument from command line
arg2=$2 # second input argument from command line
niter=50

# Checking important arguments
if [ "$arg2" == "'this'" ] || [ "$arg2" == "'that'" ]; then
  echo "Argument 2 recognized!"
else
  echo "Argument 2 NOT recognized! You will get an error!"
fi

# Run your loop
read -p "Run your cool script? [y: yes; n: no]" -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    for ((i = 1; i <= $niter; i++)); do
        # Run setup
        if [[ $i -eq 1 ]]; then
          bsub -J "Setup$i" matlab -singleCompThread -nodisplay -r "setup($arg1,$arg2,$i)"
        else
          last_iter=$((i-1))
          bsub -J "Setup$i" -w "done(saveBest$last_iter)" matlab -singleCompThread -nodisplay -r "setup($arg1,$arg2,$i)"
        fi
        # Fit your models
         bsub -J "fitModels$i" -w "done(Setup$i)" matlab -singleCompThread -nodisplay -r "fit_models($arg1,$arg2,$i)"
        # Save your best model
        bsub -J "saveBest$i" -w "done(fitModels$i)" matlab -singleCompThread -nodisplay -r "save_best_model($arg1,$arg2,$i)"
    done
fi

Compare that to a version which uses job arrays:

#!/bin/bash

# Loading matlab module
echo loading matlab module...
module load new matlab/R2017b

# Necessary input
arg1=$1 # takes in first input argument from command line
arg2=$2 # second input argument from command line
niter=50

# Run your loop
read -p "Run your cool script? [y: yes; n: no]" -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    for ((i = 1; i <= $niter; i++)); do
        # Run setup
        if [[ $i -eq 1 ]]; then
          bsub -J "Setup$i" matlab -singleCompThread -nodisplay -r "setup($arg1,$arg2,$i)"
        else
          last_iter=$((i-1))
          bsub -J "Setup$i" -w "done(saveBest$last_iter)" matlab -singleCompThread -nodisplay -r "setup($arg1,$arg2,$i)"
        fi
        # Fit your models and extract JOBID from job array
        max_sim=$((nparam-i+1))
        JOBID=$(bsub -J "fitModels$i[1-$max_sim]" -w "done(Setup$i)" matlab -singleCompThread -nodisplay -r "fit_models($arg1,$arg2,\$LSB_JOBINDEX)" | awk '/is submitted/{print substr($2, 2, length($2)-2);}')
        # Save the best model using JOBID as a dependency
        if [ -n "$JOBID" ]; then
          bsub -J "saveBest$i" -w "numdone($JOBID,*)" matlab -singleCompThread -nodisplay -r "save_best_model($arg1,$arg2,$i)"
        fi
    done
fi

Other useful resources
#

To use Euler, you’ll find yourself googling a lot of bash commands. In addition, figuring out how vim works can also be overwhelming in the beginning. To ease the pain, here are a few nice cheat sheets:

Happy computations ! 👾👾👾