Programs for Programmers

VTUNE Optimization

 

 

Optimizing Fortran programs using Intel VTUNE Performance Analyzer

My favourite programming quotation comes from Michael Jackson (really!). This Michael Jackson doesn't have a "Greatest Hits" album, but if he did, I'm sure that his two rules of optimization(1) would be in there:

  • Rule 1. Don't do it.
  • Rule 2. (for experts only) Don't do it yet.

It's not that optimization is a bad thing - in fact, Chapter 12 of Jackson's book is devoted to - you guessed it - optimization. The problem is that it's too often done wrongly. We dive in, and rewrite perfectly good code in a way which is less clear but, we suppose, more efficent. As often as not, the supposition of greater speed is wrong, and, in any case, the code we have mangled doesn't contribute significantly to the overall execution time. Things get worse when the code has to be changed. Most bugs are not designed in; they are introduced when working code is changed. Even if the original "optimizations" are bug free, subsequent changes will be more error-prone if the code is optimized for speed rather than clarity.

Jackson's rules can be restated - with less style - as follows:

  • Rule 1. Don't optimize without a good reason. There should be a real and measurable benefit in view.
  • Rule 2. Only complete working programs should be optimized. Don't fuss about speed until the program works.

When we do, eventually, get around to optimzation, we should tackle it in a pragmatic, hard-nosed way. Tools like Intel's VTUNE Analyzer should be used to identify "hot-spots" and "critical paths". Often, the majority of CPU time is spent in half a dozen lines of code, and tiny changes can have a huge impact on execution times.

VTUNE Analyzer is an immensely capable tool, but the wealth of features can make it daunting for first time users.  The easiest way to get started is to use the "Quick Performance Analysis Wizard". Just enter the name of the executable, the working directory, and any command line parameters, and click "Go". VTune Analyzer runs your application a couple of times, and presents its findings. There is a huge amount of information, accessible in tabular and graphical form with endless possibilities for exploration. VTune Analyzer uses 3 different methods to analyse performance:

  1. Counter monitoring.

    VTune Analyzer can monitor various processor and system counters as your program runs. These include Memory Available, Processor Queue Length, Context Switches per Second, Redirector Network Errors per Second and many others. You can highlight an area of concern on one of the graphs, and call up the "Tuning Assistant" which gives tips on how to improve performance. For example, VTune Analyzer may advise you that your program is short of memory, or could benefit from a second processor.

    Intel VTune Analyzer - Counter Monitor
  2. Sampling.

    VTune Analyzer notes what instruction was being executed at regular intervals (typically every ms) and compiles charts showing what tasks used most CPU time. VTune Analyzer allows you to drill down from a chart showing CPU usage by process, to threads (for multi-threaded applications), on to subprograms and eventually to particular lines of source code. This facility depends on the code having been compiled for debugging so that instruction addresses can be mapped back to lines of source code. If there is no debug information, VTune Analyzer displays disassembled machine code - for experts only!

    Intel VTune Analyzer Hotspots
  3. Call Graph Analysis.

    Unlike the first two analysis tools, Call Graph Analysis is intrusive. VTune Analyzer automatically instruments your executable so that the calling structure is analysed at run-time. Using Call Graph Analysis, VTune Analyzer goes beyond hotspots (as identified by Sampling) to identification of critical execution paths. For example, you may know that a low level linear algebra routine used by your solution engine is using most CPU, and have devoted lots of effort to speeding it up. You may also know that the routine is also used elsewhere in the program, and assume that those uses do not contribute significantly to the total CPU time; Call Graph Analysis will allow you to separate those different uses, and check your assumptions.

    Intel VTune Analyzer - call graph

You might think that, being an Intel. product, VTUNE Analyzer would baulk at running on an Athlon based computer. To test this theory, I did most of my testing on an Athlon, and have to report that there were no problems at all. Apparently the main difference is that some Pentium 4 hardware counters are not present in an Athlon. Also tuning advice is sometimes specific to Intel processors. Clearly, we can't expect Intel. to go out of their way to support AMD specific extensions, but they do support properly compatible features in non-Intel. processors.

Case Study - A large fluid dynamics program compiled using Compaq Visual Fortran

This program has a long history. In many routines, the change-logs go back 20 years and are bigger than the code itself. Counter monitoring showed little of interest, but Sampling immediately indicated that nearly half the CPU time was being spent in a routine which was not generally considered to be a bottleneck. Within that routine almost all the time was spent copying data between two arrays. The data was copied one way before a call to a subroutine called LOWER, and the other was after return.

First, I tried to reduce the time taken by these loops. Using array notation appeared to make the restore step (after return from LOWER) much faster. So

rv(1:m1) = ru(1:m1) rl(1:m1) = ru(m1+1:2*m1)<br />

was much faster than

DO k = 1,m1 rv(k) = ru(k) rl(k) = ru(k+m1) END DO<br />

This small change reduced the total execution time by 10%. However using array notation actually slowed the other copy (into ru). VTune Analyzer allows you to view the disassembled machine code with the Fortran source code, so it's quite possible to follow up little mysteries like this. However, I decided to look first at how ru was used.

Examination of LOWER revealed that only the first part of ru was actually used, so assignments involving rl were redundant. Removing these two assignments immediately reduced the execution time of the program by a gratifying 30%.

The next clue came from a review of the change log. It turned out that the time-consuming copy/restore operation resulted from an ill-conceived attempt to improve the program structure by removing EQUIVALENCE statements. To this end, the programmer decided to make a local copy of some data, but did not notice that the program took nearly twice as long to run as a result. Over the years, the local copy had been made global, and used elsewhere in the program, making it quite difficult to analyse.

Removing references to the "local copy" array made the program run 40% faster than when we started. Sampling of the revised code showed that no routine accounted for more than a few percent of the total CPU time, and that diminishing returns had set in very quickly. Any further gains would have to be hard won.

(1) "Principles of Program Design" by M.A. Jackson, Associated Press 1975 ISBN 0 12 379050 6

(C) Polyhedron Software Ltd. (2002)
By John Appleyard