Parallel Runs

Some of the TURBOMOLE modules are parallelized using the message passing interface (MPI) for distributed and shared memory machines or with OpenMP or multi-threaded techniques for shared memory and multi-core machines.

Generally there are two hardware scenarios which determine the kind of parallelization that is possible to use:

Additional keywords necessary for parallel runs with the MPI binaries are not needed. When using the parallel version of TURBOMOLE, scripts are replacing the binaries. Those scripts prepare a usual input, run the necessary steps and automatically start the parallel programs. The users just have to set environment variables, see Sec. 3.2.1 below.

To use the OpenMP parallelization only an environment variable needs to be set. But to use this parallelization efficiently one should consider a few additional points, e.g. memory usage, which are described in Sec. 3.2.2.

3.2.1 Running Parallel Jobs — MPI case

Setting up the parallel MPI environment

In addition to the installation steps described in Section 2 (see page 48) you just have to set the variable PARA_ARCH to MPI, i.e. in sh/bash/ksh syntax:

This will cause sysname to append the string _mpi to the system name and the scripts like jobex will take the parallel binaries by default. To call the parallel versions of the programs ridft, rdgrad, dscf, grad, ricc2, or mpgrad from your command line without explicit path, expand your $PATH environment variable to:

The usual binaries are replaced now by scripts that prepare the input for a parallel run and start mpirun (or poe on IBM) automatically. The number of CPUs that shall be used can be chosen by setting the environment variable PARNODES:

Finally the user can set a default scratch directory that must be available on all nodes. Writing scratch files to local directories is highly recommended, otherwise the scratch files will be written over the network to the same directory where the input is located. The path to the local disk can be set with

MPI versions, distributions and flavours

TURBOMOLE is using the MPI version which has been utilized to generate the binaries. To make sure that the parallel version is running, no matter which MPI flavour you have installed on your machines, TURBOMOLE does include the run-time version of the MPI flavour it needs.

____________________________________________________________________________ Please do not try to use TURBOMOLE with your local MPI version (OpenMPI, MPICH, ...)! Do not call the parallel MPI binaries directly, just set $PARA_ARCH as described above and call the modules the same way you use them in the serial version. ______

On Linux for PCs and Windows systems either IBM Platform MPI (formerly known as HP-MPI, now also known as IBM Spectrum MPI) is used and included — see IBM Platform MPI or Intel MPI Intel MPI.

COSMOlogic ships TURBOMOLE with a IBM Platform MPI Community Edition or the full Intel MPI version. TURBOMOLE users do not have to install or license IBM Platform MPI or Intel MPI themselves. Parallel binaries will run out of the box on the fastest interconnect that is found - Infiniband, Myrinet, TCP/IP, etc.

Note: most parallel TURBOMOLE modules need an extra server running in addition to the clients. This server is included in the parallel binaries and it will be started automatically — but this results in one additional task that usually does not need any CPU time. So if you are setting PARNODES to N, N+1 tasks will be started.

If you are using a queuing system or if you give a list of hosts where TURBOMOLE jobs shall run on (see below), make sure that the number of supplied nodes match $PARNODES — e.g. if you are using 4 CPUs via a queuing system, make sure that $PARNODES is set to 4.

Starting parallel jobs

After setting up the parallel environment as described in the previous section, parallel jobs can be started just like the serial ones. If the input is a serial one, it will be prepared automatically for the parallel run.

For the additional mandatory or optional input for parallel runs with the ricc2 program see Section 10.6.

Running calculations on different nodes

If TURBOMOLE is supposed to run on a cluster, we highly recommend the usage of a queuing system like PBS, Univa/SGE GridEngine or LFS. The parallel version of TURBOMOLE will automatically recognise that it is started from within one of the queuing systems:

Important: Make sure that the input files are located on a network directory like an NFS disk which can be accessed on all nodes that participate at the calculation.

If parallel jobs are started outside a queuing system, or if you have a non-supported or a non-default installation of above mentioned queuing systems, the number of nodes and their names can also be provided by the user. A file that contains a list of machines has to be created, each line containing one machine name:

node1
node1
node2
node3
node4
node4

Note: Do not forget to set $PARNODES to the number of lines in $HOSTS_FILE, unless you have set in addition OMP_NUM_THREADS (see below).

Note: In general the stack size limit has to be raised to a reasonable amount of the memory (or to unlimited). In the serial version the user can set this by ulimit -s unlimited on bash/sh/ksh shells or limit stacksize unlimited on csh/tcsh shells. However, for the parallel version that is not sufficient if several nodes are used, and the /etc/security/limits.conf files on all nodes might have to be changed. See chapter 2.2 of this documentation, page 56

OpenMP/MPI hybrid version

Some TURBOMOLE modules like dscf, grad, aoforce, ricc2 or pnoccsd are parallelized using a hybrid OpenMP/MPI scheme. For those modules it is sufficient to start just one single process per node. In addition, please set

when starting the job. This environment variable will be exported to each node such that the processes started there will open <number of cores per node> threads.

Memory for parallel jobs

Since there are several different parallel versions of the individual TURBOMOLE modules available, the meaning of the keywords to set memory ($ricore and $maxcor) can be quite confusing. A lot of problems can be avoided if following points are taken care of:

Testing the parallel binaries

The binaries ridft, rdgrad, dscf, grad, and ricc2 can be tested by the usual test suite: go to $TURBODIR/TURBOTEST and call TTEST

Note: Some of the tests are very small and will only pass properly if 2 CPUs are used at maximum. Therefore TTEST will not run any test if $PARNODES is set to a higher value than 2.

If you want to run some of the larger tests with more CPUs, you have to edit the DEFCRIT file in TURBOMOLE/TURBOTEST and change the $defmaxnodes option.

Linear Algebra Settings

The number of CPUs and the algorithm of the linear algebra part of Turbomole depends on the settings of $parallel_platform:

The scripts in $TURBODIR/mpirun_scripts automatically set this keyword depending on the output of sysname. All options can be used on all systems, but especially the SMP setting can slow down the calculation if used on a cluster with high latency or small bandwidth.

Sample simple PBS start script

#!/bin/sh
# Name of your run :
#PBS -N turbomole
#
# Number of nodes to run on:
#PBS -l nodes=4
#
# Export environment:
#PBS -V

# Set your TURBOMOLE pathes:

######## ENTER YOUR TURBOMOLE INSTALLATION PATH HERE ##########
export TURBODIR=/whereis/TURBOMOLE
###############################################################

export PATH=$TURBODIR/scripts:$PATH

## set locale to C
unset LANG
unset LC_CTYPE

# set stack size limit to unlimited:
ulimit -s unlimited

# Count the number of nodes
PBS_L_NODENUMBER=‘wc -l < $PBS_NODEFILE‘

# Check if this is a parallel job
if [ $PBS_L_NODENUMBER -gt 1 ]; then
##### Parallel job
# Set environment variables for a MPI job
    export PARA_ARCH=MPI
    export PATH="${TURBODIR}/bin/‘sysname‘:${PATH}"
    export PARNODES=‘expr $PBS_L_NODENUMBER‘
else
##### Sequentiel job
# set the PATH for Turbomole calculations
    export PATH="${TURBODIR}/bin/‘sysname‘:${PATH}"
fi

#VERY important is to tell PBS to change directory to where
#     the input files are:

cd $PBS_O_WORKDIR

######## ENTER YOUR JOB HERE ##################################
jobex -ri > jobex.out
###############################################################

3.2.2 Running Parallel Jobs — SMP case

The SMP version of TURBOMOLE currently combines three different parallelization schemes which all use shared memory:

Setting up the parallel SMP environment

In addition to the installation steps described in Section 2 (see page 48) you just have to set the variable PARA_ARCH to SMP, i.e. in sh/bash/ksh syntax:

This will cause sysname to append the string _smp to the system name and the scripts like jobex will take the parallel binaries by default. To call the parallel versions of the programs (like ridft or aoforce) from your command line without explicit path, expand your $PATH environment variable to:

The usual binaries are replaced now by scripts that prepare the input for a parallel run and start the job automatically. The number of CPUs that shall be used can be chosen by setting the environment variable PARNODES:

NOTE: Depending on what you are going to run, some care has to be taken that the system settings like memory limits, etc. will not prevent the parallel versions to run. See the following sections.

OpenMP parallelization of almost all time consuming modules

The OpenMP parallelization does not need any special program startup. The binaries can be invoked in exactly the same manner as for sequential (non-parallel) calculations. Just set the environment variable PARNODES to the number or threads that should be used by the programs. The scripts will set OMP_NUM_THREADS to the same value and start the OpenMP binaries directly. The number of threads is essentially the max. number of CPU cores the program will try to utilize. To exploit e.g. all eight cores of a machine with two quad-core CPUs set

Presently the OpenMP parallelization of ricc2 comprises all functionalities apart from the

(

⁴)-scaling LT-SOS-RI functionalities (which are only parallelized with MPI) and expectation values for Ŝ² (not parallelized). Note that the memory specified with $maxcor is for OpenMP-parallel calculation the maximum amount of memory that will be dynamically allocated by all threads together. To use your computational resources efficiently, it is recommended to set this value to about 75% of the physical memory available for your calculations.

For Localized Hartree-Fock calculations please use the dscf program which is parallelized using OpenMP. In this case an almost ideal speedup is obtained because the most expensive part of the calculation is the evaluation of the Fock matrix and of the Slater-potential, and both of them are well parallelized. The calculation of the correction-term of the grid will use a single thread.

The OpenMP parallelization of riper covers all contributions to the Kohn-Sham matrix and nuclear gradient. Hence an almost ideal speedup is obtained.

Multi-thread parallelization of dscf, grad, aoforce, escf, egrad, ridft and rdgrad

The parallelization of those modules is described in [26] and is based on fork() and Unix sockets. Except setting PARNODES which triggers the environment variable SMPCPUS, the environment variable

has to be set. Alternatively, the binaries can be called with -smpcpus <N> command line option or with the keyword $smp_cpus in the control file.

The efficiency of the parallelization is usually similar to the default version, but for ridft and rdgrad RI-K is not parallelized. If density convergence criteria ($denconv) is switched on using ridft and if no RI-K is being used, the multi-threaded version should be used.

SMP/MPI version of ridft and rdgrad

Since TURBOMOLE version 7.2 the usage of GlobalArrays has been omitted. Instead, a set of routines which utilize shared memory on a node has been implemented. Both modules, ridft and rdgrad, start each process as an individual MPI instance. Processes on the same node are then collected to collectively store and use data in a shared memory region. This avoids excessive memory usage and reduces the amount of memory requirements significantly, especially compared to the old MPI implementation (which has been used by default in former TURBOMOLE versions). It is nevertheless recommended to

3.2 Parallel Runs