Tel: +44(0)1865 300 579
Fax: +44(0)1865 300 232

Programs for Programmers

OpenMP

Getting Started with OpenMP and the Intel Win32 compiler

OpenMP consists of a set of compiler directives that extend a C++ or Fortran compiler to take advantage of a multi-core processor with shared memory.  These directives are added as pragmas in C++ or special comments in Fortran.  This means that if the compiler doesn’t support OpenMP, it will simply ignore the directives and the program will still compile and run as before.

 Hello World

 Let’s create an OpenMP Hello World example in Visual Studio as follows:

  1. Start a new project with FileOpen ProjectIntel Fortran projectsConsole Application and give it the name HelloWorld.
     
  2. Enter the following code into the HelloWorld.f90 source:
     
    program hello90
    integer :: id, nthreads
    !$omp parallel private(id)
      id = omp_get_thread_num()
      write (*,*) 'Hello World from thread', id
    !$omp barrier
      if ( id .eq. 0 ) then
                   nthreads = omp_get_num_threads()
                   write (*,*) 'There are', nthreads, 'threads'
      end if
    !$omp end parallel
    end program

 

  1. Go to Project HelloWorld PropertiesFortranLanguage and for the option Process OpenMP directives select Generate Parallel Code  (this adds the /Qopenmp switch to the project)
     
  2. Build and run the project

    If running on a dual-core processor, the output will be:
     
    Hello World from thread                      0
    Hello World from thread                      1
    There are                         2 threads



In the above example, we are using the following:

!$omp parallel private(id)       Define the  beginning of a parallel region and keep variable id private in each thread

!$omp barrier                       synchronize all threads; wait until all threads have reached this point

!$omp end parallel                End of parallel region 

DO Loops

One of the first things to look at when looking at parallelizing an existing program is the DO loop:

!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C)
!
!OMP$ PARALLEL DO
! Specifies a parallel region that
! implicitly contains a single DO directive
DO I = 1, 1000
  NUM = FOO(B(I), C(I))
  X(I) = BAR(A(I), NUM)
! Assume FOO and BAR have no other effect
ENDDO

 

In this case, the directives state that the variable NUM is to be private to each thread, as this is where the intermediate calculation is being placed, and that the variables X,A,B,C can be shared as either they are not being altered or in the case of X, it contains the final results of the calculations.

SECTIONS

Now let’s look at a program that parallelizes two sections of code. This is an F90 translation of a C program found at http://www.kallipolis.com/openmp/.  Again, note that very few extra lines of code are required to split the program into sections and this example really shows the time-saving of using OpenMP.

program taylor
   use ifport
   implicit none
!
! This program calculates the value of e*pi by first calculating e
! and pi by their taylor expansions and then multiplying them together.
 
  parameter num_steps = 20000000
 
  double precision starttime, stoptime ! timer variables
  double precision e, pi, factorial, product
  integer i
 
! start the timer
  call clockx(starttime)
!$omp parallel sections shared(e,pi)
!$omp section
! First we calculate e from its taylor expansion
  print *,'e started'
  e = 1
  factorial = 1
  do i=1,num_steps
    factorial = factorial * i
    e = e + 1.0/factorial
  enddo
  print *,'e done'
!$omp section 
! Then we calculate pi from its taylor expansion
  print *,'pi started'
 
  pi = 0
  do i = 0,num_steps*10
    pi = pi + 1.0/(i*4.0 + 1.0)
    pi = pi - 1.0/(i*4.0 + 3.0)
  enddo
  pi = pi * 4.0
  print *,'pi done'
    product = e * pi
 
  call clockx(stoptime)
!$omp end parallel sections
 
  print *,'Reached result ',product,' in ',(stoptime-starttime)/1000000, ' seconds'
end program taylor

  

 Try loading this program into an Intel 9.1/10.0 Visual Studio project, but leave the Process OpenMP directives switch off in ProjectPropertiesFortranLanguage.

 The results when using a Pentium D 820 in a Dell 9100 with 4 Gig RAM were:

e started
e done
pi started
pi done
Reached result    8.53973421586912       in    22.7030000000000       seconds

  Now, turn the Process OpenMP  Directives switch back to Generate Parallel Code, rebuild, and we get the following results:

e started
e done
pi started
pi done
Reached result    8.53973421586912       in    8.98400000000000       seconds

Bringing the elapsed time down from 22.7 seconds to 8.9 seconds seems too good to be true, but the Intel explanation is that the Windows scheduler can be inefficient when running heavy tasks on a single CPU and actually gives less CPU time than could be available!

 Looking at the Performance comparison in the Task Manager, we see that the first run (on left hand) only uses 50% of the CPU, whereas the OpenMP version (on right hand) makes 100% use of the CPU. 

Viewing OpenMP performance in Task Manager

Identifying bottlenecks using Intel VTune

Now let’s look at a larger program and see how we can identify where the hotspots are for parallelization.  We’ve taken the Intel Quickwin sample program SCIGRAPH supplied with the V9Samples download and inserted the Taylor code above as a subroutine and called it from one of the routines in the demo.  Build and run it, (again with the Process OpenMP directives switch off ) to make sure all is OK. 

 

So, how do we use Intel’s VTune analyzer to find out what is going on?  As it’s integrated into Visual Studio, we can do everything within the same development environment, so click on TuningCreate New Activity and select Sampling Wizard, then browse for the .exe file that has just been created and run the application.  When it has finished, click on TuningGet Tuning Advice.  This launches the Tuning Assistance which gives lots of information relating to processor and program activity. 


The Hotspots Insights button is where we want to go to see a list of processor-intensive activity.  Top of the list is our newly inserted module SCIGRAPHDEMO_mp_TAYLOR showing the total number of clockticks, parallel activity and processor utilization.  If we  click on the module name, it takes us straight to the source code (providing the project was build with Full Debug), so that we can start looking at the code that is taking the most time.  In our case, we have already identified the code by putting in some OpenMP directives, so now, let’s build again with the Process OpenMP directives switch back on.  Run a new Activity and look at the Tuning Advice again.

 

The other useful tuning activity is the Call Graph Wizard.  This shows a pictorial representation of program flow by showing what modules call what modules and again gives useful information about process time, how many times the module is called and which thread it has been running in.  Again the view can be drilled down right through to the source to check the actual code of the module being analyzed.  (To be able to view the source code, you will need to build the application with the switch ProjectPropertiesLinkerAdvancedFixed Base Address set to Generate a relocation section, otherwise you will receive an error about no Base relocations available when building.)

This is just a small flavour of using OpenMP to parallelize an existing program.  Check out the links below for further references and code examples

Links:

Official OpenMP site      http://www.openmp.org/

Start with OpenMP         http://www.devx.com/go-parallel/Article/33633

Intel Multi-core portal      http://www.intel.com/cd/ids/developer/asmo-na/eng/328626.htm