Tel: +44(0)1865 300 579
Fax: +44(0)1865 300 232
Getting Started with OpenMP and the Intel Win32 compiler
OpenMP consists of a set of compiler directives that extend a C++ or Fortran compiler to take advantage of a multi-core processor with shared memory. These directives are added as pragmas in C++ or special comments in Fortran. This means that if the compiler doesn’t support OpenMP, it will simply ignore the directives and the program will still compile and run as before.
Hello World
Let’s create an OpenMP Hello World example in Visual Studio as follows:
| program hello90 integer :: id, nthreads !$omp parallel private(id) id = omp_get_thread_num() write (*,*) 'Hello World from thread', id !$omp barrier if ( id .eq. 0 ) then nthreads = omp_get_num_threads() write (*,*) 'There are', nthreads, 'threads' end if !$omp end parallel end program |
| Hello World from thread 0 Hello World from thread 1 There are 2 threads |
In the above example, we are using the following:
!$omp parallel private(id) Define the beginning of a parallel region and keep variable id private in each thread
!$omp barrier synchronize all threads; wait until all threads have reached this point
!$omp end parallel End of parallel region
DO Loops
One of the first things to look at when looking at parallelizing an existing program is the DO loop:
| !OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C) ! !OMP$ PARALLEL DO ! Specifies a parallel region that ! implicitly contains a single DO directive DO I = 1, 1000 NUM = FOO(B(I), C(I)) X(I) = BAR(A(I), NUM) ! Assume FOO and BAR have no other effect ENDDO |
In this case, the directives state that the variable NUM is to be private to each thread, as this is where the intermediate calculation is being placed, and that the variables X,A,B,C can be shared as either they are not being altered or in the case of X, it contains the final results of the calculations.
SECTIONS
Now let’s look at a program that parallelizes two sections of code. This is an F90 translation of a C program found at http://www.kallipolis.com/openmp/. Again, note that very few extra lines of code are required to split the program into sections and this example really shows the time-saving of using OpenMP.
| program taylor use ifport implicit none ! ! This program calculates the value of e*pi by first calculating e ! and pi by their taylor expansions and then multiplying them together. parameter num_steps = 20000000 double precision starttime, stoptime ! timer variables double precision e, pi, factorial, product integer i ! start the timer call clockx(starttime) !$omp parallel sections shared(e,pi) !$omp section ! First we calculate e from its taylor expansion print *,'e started' e = 1 factorial = 1 do i=1,num_steps factorial = factorial * i e = e + 1.0/factorial enddo print *,'e done' !$omp section ! Then we calculate pi from its taylor expansion print *,'pi started' pi = 0 do i = 0,num_steps*10 pi = pi + 1.0/(i*4.0 + 1.0) pi = pi - 1.0/(i*4.0 + 3.0) enddo pi = pi * 4.0 print *,'pi done' product = e * pi call clockx(stoptime) !$omp end parallel sections print *,'Reached result ',product,' in ',(stoptime-starttime)/1000000, ' seconds' end program taylor |
Try loading this program into an Intel 9.1/10.0 Visual Studio project, but leave the Process OpenMP directives switch off in Project – Properties – Fortran – Language.
The results when using a Pentium D 820 in a Dell 9100 with 4 Gig RAM were:
|
e started |
Now, turn the Process OpenMP Directives switch back to Generate Parallel Code, rebuild, and we get the following results:
|
e started |
Bringing the elapsed time down from 22.7 seconds to 8.9 seconds seems too good to be true, but the Intel explanation is that the Windows scheduler can be inefficient when running heavy tasks on a single CPU and actually gives less CPU time than could be available!
Looking at the Performance comparison in the Task Manager, we see that the first run (on left hand) only uses 50% of the CPU, whereas the OpenMP version (on right hand) makes 100% use of the CPU.

Identifying bottlenecks using Intel VTune
Now let’s look at a larger program and see how we can identify where the hotspots are for parallelization. We’ve taken the Intel Quickwin sample program SCIGRAPH supplied with the V9Samples download and inserted the Taylor code above as a subroutine and called it from one of the routines in the demo. Build and run it, (again with the Process OpenMP directives switch off ) to make sure all is OK.
So, how do we use Intel’s VTune analyzer to find out what is going on? As it’s integrated into Visual Studio, we can do everything within the same development environment, so click on Tuning – Create New Activity and select Sampling Wizard, then browse for the .exe file that has just been created and run the application. When it has finished, click on Tuning – Get Tuning Advice. This launches the Tuning Assistance which gives lots of information relating to processor and program activity.

The Hotspots Insights button is where we want to go to see a list of processor-intensive activity. Top of the list is our newly inserted module SCIGRAPHDEMO_mp_TAYLOR showing the total number of clockticks, parallel activity and processor utilization. If we click on the module name, it takes us straight to the source code (providing the project was build with Full Debug), so that we can start looking at the code that is taking the most time. In our case, we have already identified the code by putting in some OpenMP directives, so now, let’s build again with the Process OpenMP directives switch back on. Run a new Activity and look at the Tuning Advice again.
The other useful tuning activity is the Call Graph Wizard. This shows a pictorial representation of program flow by showing what modules call what modules and again gives useful information about process time, how many times the module is called and which thread it has been running in. Again the view can be drilled down right through to the source to check the actual code of the module being analyzed. (To be able to view the source code, you will need to build the application with the switch Project – Properties – Linker – Advanced – Fixed Base Address set to Generate a relocation section, otherwise you will receive an error about no Base relocations available when building.)
This is just a small flavour of using OpenMP to parallelize an existing program. Check out the links below for further references and code examples
Links:
Official OpenMP site http://www.openmp.org/
Start with OpenMP http://www.devx.com/go-parallel/Article/33633
Intel Multi-core portal http://www.intel.com/cd/ids/developer/asmo-na/eng/328626.htm