CrayPAT

Contents


Recipes for Profiling with CrayPAT 4.2

Simple Profiling

Application Instrumentation with pat_build -

  • No source code or makefile modification required
    • Automatic instrumentation at group (function) level
      • Groups: mpi, io, heap, math SW, user functions …
    • Performs link-time instrumentation
      • Requires object files
      • Instruments optimized code
      • Generates stand-alone instrumented program
      • Preserves original binary
      • Supports sample-based and event-based instrumentation

      Use the following steps to profile your code:

      1. Remove all object files and any other user libraries that you want profiled. (Probably need to do a make clean.)

      2. Issue the command module load xt-craypat.

      3. Compile your codes with the options usually used.

      4. Rebuild your code, probably via make.

        • You must rebuild to ensure that proper symbols are “put” into the code for profiling.

      5. Run the pat_build command to build an instrumented executable. The command will be of the form pat_build [-w -u -g <group>]-O apa a.out [a.out+pat].

        • Tracing is currently the only option.

        • -u : Indicates to trace all user functions.

        • -g : Possible groups are mpi, io, heap (blas and lapack coming).

        • Original .o files must still be around.

      6. Run the instrumented executable, such as aprun -n 160 a.out+pat

      7. Run pat_report on the data file, something like pat_report <datafile>.xf.

        • We highly recommend doing a pat_report -f ap2 <datafile>.xf to create an .ap2 compressed format file that can be used as input to pat_report or Cray Apprentice2.

          • The .ap2 file is portable; it can be moved to any other machine with Cray Apprentice2 and have reports run. That cannot be done with the .xf files.

        • See man pat_report for details.

          See “Performance Measurement and Visualization on the Cray XT ” (pdf) for a more complete description on how to use CrayPAT

      Simple Hardware Performance Counter Data

      1. If you don’t have an instrumented code, complete steps 1–5 as above. If you already have done steps 1–5 above, go on to the next step below.

      2. Set the PAT_RT_HWPC environment variable to a value from 1 to 9.

        • 1 : FP, LS, L1 Misses & TLB Misses

        • 2 : L1 & L2 Data Accesses and Misses

        • 3 : L1 Accesses, Misses, and Bandwidth

        • 4 : Floating Point Mix

      3. Run instrumented code again.

      4. Run pat_report.


      Overview of CrayPAT Tools

      Profiling

      Profiling with the Cray tools requires multiple steps. Unlike the X1E it does require you to recompile your code. First, to use the Cray profiling tools, you must load the craypat module such as module load craypat. Then you must recompile your code with ftn or cc (the Cray wrappers) to link in the appropriate Cray performance tools/libraries. If you are compiling with pgf95 or pgcc, these compilers are not automatically linking in the Cray performance libraries. Furthermore, if you use “Fortran modules,” then you have to compile and link your code with -Mprof=func to get a proper profile.

      Two important man pages to check out pat and pat_build. And note that the Fortran application programming interface (API) is similar to the C API. All accept an additional argument for the status of the call (which in C is provided as the return value).

      pat_build

      Builds an instrumented version of an executable code.

      > pat_build [options] <executable> <instrumented executable>

      Supports

      • Fortran, C, C++

      • MPI, SHMEM

      Performance measurements

      • Trace based

        • User functions

        • API for fine-grain instrumentation

        • Predefined function groups (mpi, shmem, io, etc.)

      Source code mapping

      • Call stack

      • Line numbers

      pat_run

      User interface to simplify CrayPAT usage. Runs an instrumented executable and generates a report, all in one step.

      The following executes a.out+pat and produces a report measuring the number of floating point operations, calculating the mflop rate, and determining the average number of results produced per vector operation for the traced functions.

      pat_run -O flops,mflops,vl yod -sz 1 a.out+pat

      The following produces a load-balance report showing average versus maximum time per processor (based on wall-clock time) for an MPI program:

      pat_run -O balance yod -sz 4 a.out+pat

      The “-O” option is a comma-list of keywords to specify the following:

      • How to record it: trace

      • Show callers: callers, calltree

      • Show source/line number: source, line

      • Show load balance: balance[.$data][.$by]

        • $data can be samples or time (default), cycles, etc.

        • $by can be pe (default), thread, or ssp

      Examples

      To get basic profile run, use the following:

      pat_run -b [pe,]function:source,line [-s percent=relative] yod -sz 1 <instrumented executable>

      In the output file, use the following:

       100.0% |    100.0% |  965 |Total|-------------------------------------|  88.2% |     88.2% |  851 |kron_matmull@module_kron_ 
      
      ||------------------------------------ 
      
      ||  40.4% |     40.4% |  344 |line.307 
      
      ||  37.0% |     77.4% |  315 |line.297

      To get a calltree run, use the following:

      pat_run -b [pe,] function:source,calltree [-s percent=relative] yod -sz 1 <instrumented executable>

      pat_report

      You can directly run an instrumented executable with yod, which will produce a performance-data file (ending in .xf). This file can then be processed into a human-readable text profile using the pat_report command.

      Experiment Types

      There is only one type of performance experiment that you can run, - trace.

      See the pat man page for more information.

      Run-Time Library

      Use the PAT run-time library to get statistics on a specific region of code.

      Example

      program test_module_kronuse pat_apiinteger ierr 
      
      … 
      
      ! Begin region of interest 
      
      call PAT_region_begin ( 1, 'kron_matmul_kernel', ierr ) ! # and name must be unique to each region 
      
      call kron_matmulL(…) 
      
      ! End region of interest 
      
      call PAT_region_end   ( 1, ierr ) 
      
      end program

      Compile

      ftn *.f  -o test.exe

      Relink

      pat_build -w test.exe test.exe+pat

      Run and produce a report

      pat_run -g normal [-b function,ssp=HIDE] yod -sz 1 test.exe+pat

      Apprentice2 Visualizer

      Apprentice2 is targeted to help identify and correct

      • Excessive communication

      • Network contention

      • Load imbalance

      • Excessive serialization

      Supports

      • Call graph profile

      • Communication statistics

      • Timeline view (Must have PAT_RT_SUMMARY set to 0 before running instrumented code.)

        • Communication

        • I/O

      • Activity view

      • Pairwise communication statistics

      • Text reports

      • Source code mapping

      Apprentice2 (invoked with app2) takes as input an XML file. The input file is generated as follows:

      module load apprentice2pat_report –c records –f ap2 <perf.file>.xf > <perf.file>.ap2

      Visualization is possible with both profiles and trace files, but Apprentice2 has less functionality with profiles. The following features are supported for profiles (run-time summaries):

      • Call graph view

      • Function statistics overview

      • Function report

      • Programming environment (PE) breakdown

      • General information


      Hardware Performance Counters

      pat_hwpc

      pat_hwpc collects hardware performance counters information for an application. No instrumentation is required. Usage is as follows:

        pat_hwpc [options] yod <executable>

      pat_hwpc accepts various hardware counters groups and produces a report with raw counts and derived metrics for the whole execution. The hardware counters are summed across all threads in each process. See the pat_hwpc man page.