Using OpenMP or Auto-parallelism

Parallelism beyond a single node (16 CPUs on Condo cluster) requires the use of MPI,
however MPI requires major changes to an existing program. Two ways exist to
get parallelism within a single 16 CPU node: parallelism can either be obtained
with automatic parallelism (the -parallel Intel compiler option) or with OpenMP
(the -openmp Intel compiler option).

The simplest way to get parallel execution is to add '-parallel' to your compile
command.

For more information on automatic parallelization with Intel compilers refer to
this document.

Another simple way to obtain parallelism is by using OpenMP,
which can be used to express parallelism on a shared memory machine.
Since each of the nodes on Condo cluster is a shared memory
machine with 16 processors, OpenMP can be used to obtain
parallelism for 16 processors.
It requires changes to the program but not nearly as much as
MPI. (The gains are generally less than for MPI, but greater
than that for automatic parallelism.)

E.g.

Having the OpenMP directive
!OMP$ PARALLEL DO
just before

do j=2,n-1
do i=2,m-1
a(i,j)=(b(i,j+1)+b(i,j-1)+b(i-1,j)+b(i+1,j)+4.d0*b(i,j))/6.d0
enddo
enddo

signals to an OpenMP compiler that the j loop can be performed on multiple
processors.

When run, issue

setenv OMP_NUM_THREADS 16
./a.out

and the program will be run with 16 "threads" which can run on each of
the 16 processors. Everything runs on just one thread until the
above directive is reached, when each of the threads performs 1/16-th
of the work in the j loop.

Without the -mp flag on the compilation step the directive is
ignored as a comment.

For C and C++, pragmas are used rather than directives.

In general, OpenMP programs run the fastest when most of the operations are
on data which is "private" rather than "shared". See the OpenMP Specifications
for the meaning of private and shared data with regard to OpenMP.

The Intel compilers on Condo cluster implement OpenMP 3.0 and supports most of the features of OpenMP 4.0 .