FAQ

General Frequently Asked Questions

Table of Contents

Compiling/Linking

Running Jobs

Lustre File System

Runtime Messages/Errors

Miscellaneous

Compiling/Linking

Why does my compile fail with “usr/bin/ld: can not find -lsma”?

This error message occurs when using the mpi* compiler wrappers (mpicc, mpif90, etc.). These are intermediate wrappers that should not be called directly by users. Instead, users should compile with either ftn, cc, or CC. The ftn, cc, and CC scripts will do the necessary setup and then automatically call the appropriate intermediate scripts and ultimately the compilers.

Why does my compile fail with the message “relocation truncated to fit: R_X86_64_PC32″?

The default memory model for the PGI compilers is the “small” model. This requires that the object be smaller than 2 GB in size. The PGI compilers support the “medium” memory model, which allows objects to be larger than 2 GB. Unfortunately, for a code to use the medium memory model, all objects and static libraries must be compiled under the medium memory model. Several system libraries are not, so in general, executables on Jaguar must use the small memory model.

The “relocation truncated” error message occurs when an object file or executable is too large for the memory model. To work around this error, you should reduce the static memory usage for your code. Common ways to do this include the following:

  • Remove (either by deleting or via compiler directives) subroutines that are not used on the XT platform.
  • Remove static variables (especially large arrays) that are not used on the XT platform.
  • Use allocatable arrays instead of static arrays. Because the memory model applies to only static size, allocatable arrays can be larger than 2 GB with the small memory model.

This limitation is typically not a problem for programs that will run in dual-core mode because each core has only 2 GB of memory. However, if you plan to run in single-core mode and use the entire
4 GB of available memory, you will need to ensure the static size of your executable is less than 2 GB.

How do I link a C program that calls Fortran routines?

Use the pgf90 compiler to link and provide the -Mnomain option.

What does “multiple definition of main” and/or “undefined reference to MAIN_” mean?

This most likely means you have a C program that calls Fortran, and you are linking with the Portland Group Fortran compiler. The Fortran compiler has its own default “main,” and now there is a second main from the C source. You just need to add the -Mnomain flag during link time to fix this.

What do I do with “configure: error: linking to Fortran libraries from C fails”?

That message sometimes comes as a result of using configure on the XT3 with the FC=ftn and CC=cc compilers. The error usually shows up in the configure log with the following output:

checking how to get verbose linking output from ftn... -v
checking for Fortran libraries of ftn...  -L/opt/acml/2.7/pgi64/lib/cray/cnos64 -llapacktimers -L/opt/xt-mpt/1.3.15/mpich2-64/P2/lib -L/opt/acml/2.7/pgi64/lib -L/opt/xt-libsci/1.3.15/pgi/cnos64/lib -L/opt/xt-mpt/1.3.15/sma/lib -L/opt/xt-tools/papi/3.2.1/lib/cnos64 -lpapi -lperfctr -L/opt/xt-lustre-ss/1.3.15/lib64 -L/opt/xt-catamount/1.3.15/lib/cnos64 -L/opt/xt-pe/1.3.15/lib/cnos64 -L/opt/xt-libc/1.3.15/amd64/lib -L/opt/xt-os/1.3.15/lib/cnos64 -L/opt/xt-service/1.3.15/lib/cnos64 -L/opt/pgi/6.1.1/linux86-64/6.1/lib -L/opt/gcc/3.2.3/lib/gcc-lib/x86_64-suse-linux/3.2.3/ -lacml -lmpichf90 -lsci -lmpich -llustre -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgf90rtl -lpgftnrtl -lpgc -lm -lcatamount -lsysio -lportals -lC -lcrtend' -lcrtend
checking for dummy main to link with Fortran libraries... unknown
configure: error: linking to Fortran libraries from C fails
See 'config.log' for more details.

If you look at the end of the Fortran libraries line, you will see “-lcrtend’ -lcrtend.” There is an extra “‘”. To get around this, usually you specify this long line of Fortran libraries in a environment variable like FLIBS or FCLIBS with the extra “‘” and the extra “-lcrtend” removed.

My code compiles without any trouble, but fails in the link step.

Internally, the compilers use several variables/macros even if they’re not specified on the command line. These include F90FLAGS, FFLAGS, CFLAGS, and others. If your makefile defines these variables with flags not intended for the link step, the link may fail. For example, if they contain the -c flag, which tells the compiler to skip the link step, the link will fail.

Can I use the 1.5 programming environments on the CNL system?

The 1.5 programming environments are available on the CNL system. However, they will build for Catamount and should not be used on the CNL system. The 2. and greater programming environment versions should be used on the CNL system.

How do I link a C++ object with ftn? It worked on the Catamount system without modification.

Under the 1.5 programming environments used under Catamount, ftn linked in libC.a. Under the 2. programming environments used under CNL, ftn does not link in libC.a. Fortran codes that link in libraries that contain C++ objects will need to add -lC to the link line.

libc.a is added to the link under 2. as it was under 1.5. Adding -lc to the link will result in multiple definition warnings.

Why do I see the message: SEEK_SET is #defined but must not be for the C++ binding of MPI?

The following error message:


#error "SEEK_SET is #defined but must not be for the C++ binding of MPI"


Is the result of a name conflict between stdio.h and the MPI C++ binding. Users should place the mpi include before the stdio.h and iostream includes.


Users may also see the following error messages as a result of including stdio or iostream before mpi:


#error "SEEK_CUR is #defined but must not be for the C++ binding of MPI"


#error "SEEK_END is #defined but must not be for the C++ binding of MPI"

Running Jobs

How do I find out what nodes I am using?

There are a couple of easy ways to find out what nodes are assigned to your batch job. The easiest is to issue checkjob <jobid>. Part of the output will return a list of nodes like the following:

Allocated Nodes:      

[84:1][85:1][86:1][87:1][88:1][89:1][90:1][91:1]

Another way to find out what nodes your batch job has is to run the nodeinfo tool that we have installed. This can only be run inside a batch job. Just add the following line to your batch script before the execution step:

/opt/public/bin/nodeinfo-cnl

This tool will return a list of nodes (one per line) as well as statistics about each node, as follows:

PE Node Processor CPU Speed Rev Cores Mem Size Mem Speed Seastar Speed   

  0 84 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5980 MB/s SS1 1109 MB/s   

  1 85 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  2 86 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  3 87 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5983 MB/s SS1 1109 MB/s   

  4 88 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  5 89 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5979 MB/s SS1 1109 MB/s   

  6 90 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s   

  7 91 Opteron 285 2.6 GHz E 2 4096 MB DDR-400 5982 MB/s SS1 1109 MB/s

The above two methods return the same logical numbering of nodes. A physical numbering of the nodes as well as the pid layout can be obtained by setting the PMI_DEBUG variable to 1.

gt; setenv PMI_DEBUG 1
> aprun -n4 ./a.out
Detected aprun CNOS interface
MPI rank order: Using default aprun rank ordering
rank 0 is on nid00015 pid 76; originally was on nid00015 pid 76
rank 1 is on nid00015 pid 77; originally was on nid00015 pid 77
rank 2 is on nid00016 pid 69; originally was on nid00016 pid 69
rank 3 is on nid00016 pid 70; originally was on nid00016 pid 70

From within your code, you can reference PMI_CNOS_Get_nid to get the physical number for each process.

#include <stdio.h>
#include "mpi.h"int main (int argc, char *argv[])
{
  int rank,nproc,nid;
  int i;
  MPI_Status status;
MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
PMI_CNOS_Get_nid(rank, &nid);
printf("  Rank: %10d  NID: %10d  Total: %10d n",rank,nid,nproc);
MPI_Finalize();
return 0;
}

The output with four cores would be as follows:

aprun -n4 ./hello-mpi.x
  Rank:          1  NID:         15  Total:          4
  Rank:          0  NID:         15  Total:          4
  Rank:          2  NID:         16  Total:          4
  Rank:          3  NID:         16  Total:          4
Application 13390 resources: utime 0, stime 0

The aprun -q option can be used to run commands outside of a code as shown below.

> aprun -q -n4 /bin/hostname
nid00015
nid00015
nid00016
nid00016
>

Or

> aprun -q -n4 /bin/cat /proc/cray_xt/nid
15
15
16
16
>

Why do I get the error “qsub: Job exceeds queue resource limits MSG=cannot satisfy server max mem requirement” when submitting a job?

The queuing system on the XT4 does not allow memory requests with the #PBS -lmem= flag. Jobs requesting memory will be rejected with the error message shown above.

Memory on the XT4 is not shared between nodes. When running in virtual node mode (dual-core mode), each task has access to 2 GB of memory. In single node (single-core) mode, each task can access 4 GB of memory. Thus, memory is directly related to the number of processors requested. Because the memory is not shared, it does not make sense to request memory directly via PBS. (It is implicitly requested based on the #PBS -lsize=... request.)

Can I run size=0 jobs?

Yes, size=0 jobs are supported. These jobs are a good way to automate data transfers to HPSS. The hsi command runs on a service node. So, if you use hsi at the conclusion of a production run, all of the compute nodes your job was allocated remain idle. As an alternative, you can submit a production job, and then submit a second ‘data transfer’ job. This second job should be submitted with a dependency on the first job so that it will not start until the first job finishes. Additionally, it should be submitted with a size argument of 0. Since hsi runs on service nodes, it does not require any compute node (thus, size=0).

NOTE: Jobs requesting size=0 should not use the PBS feature option. This creates a dependency condition that the system can’t satisfy (the system can’t allocate 0 compute nodes and compute nodes with a certain feature string). Thus, a job submitted with #PBS -l size=0,feature=800 will remain in a queued state indefinitely.

Lustre File System

How is striping set up in Lustre?

The lfs command can be used to determine the Lustre file system setup. Note that each file and directory can have its own striping pattern. This means that a user can set striping patters for his own files and/or directories. The default stripe width after the July 2006 hardware upgrade is 4.

This command will give you information on the striping information for a directory/file.

lfs find -v <directory/file>

If the command returns has no stripe info, then that means the directory/file is set to not stripe, or in other words the stripe width is 1.

How do I change the striping in Lustre?

A user can change the striping settings for a file or directory in Lustre by using the lfs command. More specifically, one would use lfs setstripe <directory> <options>. Note that if you change the settings for existing files, the file will get the new settings only if it is recreated. If you change the settings for an existing directory, you will need to copy the files elsewhere and then copy them back to inherit the new settings.

We believe that the best setting for a program in which each process writes out its own file(s) is

> lfs setstripe <directory> 0 -1 1

That is, do not use striping. Then we see that

> lfs find -v testdirectory
OBDS:
0: ost1_UUID ACTIVE
1: ost2_UUID ACTIVE
2: ost3_UUID ACTIVE
3: ost4_UUID ACTIVE
4: ost5_UUID ACTIVE
5: ost6_UUID ACTIVE
6: ost7_UUID ACTIVE
7: ost8_UUID ACTIVE
8: ost9_UUID ACTIVE
9: ost10_UUID ACTIVE
10: ost11_UUID ACTIVE
11: ost12_UUID ACTIVE
12: ost13_UUID ACTIVE
13: ost14_UUID ACTIVE
14: ost15_UUID ACTIVE
15: ost16_UUID ACTIVE
testdirectory/
default stripe_count: 1 stripe_size: 0 stripe_offset: -1

This shows we have a stripe count of 1 (no striping), the stripe size is set to 0 (which means use the default), and the stripe offset is set to -1 (which means to round-robin the files across the OSTs). You should always use -1 for stripe_offset.

The stripe count and stripe size are something you can tweak for performance.

Run-Time Messages/Errors

What does “MPIDI_PORTALSU_REQUEST_FDU_OR_AEP: DROPPED EVENT ON UNEXPECTED RECEIVE QUEUE” mean?

By setting

MPICH_PTL_SEND_CREDITS=-1

A flow control mechanism can be enabled. See the mpi_intro man page for details.

For best performance, the number of event queue entries for the MPI unexpected receive queue should be set as high as possible.

MPICH_PTL_UNEX_EVENTS=80000

Note that this fix does not address unexpected message buffer exhaustion. Thus, the user may still need to adjust MPICH_MAX_SHORT_MSG_SIZE or MPICH_UNEX_BUFFER_SIZE if this buffering overflows.

I get the runtime error MPI has run out of PER_PROC Message Packets.

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
*** MPI has run out of PER_PROC message packets.
*** The current allocation levels are:
***     MPI_MSGS_PER_PROC = 16384
_pmii_daemon(SIGCHLD): PE 1 exit signal Aborted
[NID 2987]Apid 566145: initiated application termination

Even though the message refers to MPI_MSGS_PER_PROC, you will need to increase the variable MPICH_MSGS_PER_PROC to a number greater than the number of cores requested by the job. The MPICH_MSGS_PER_PROC “Specifies the maximum number of internal message headers that can be allocated by MPI”. The default value is 16,384.

Why do I see a no space left on device error?

A no space left on device error will be returned during file I/O if one of the file’s associated OSTs becomes 100% utilized. An OST may become 100% utilized even if there is space available on the filesystem.

You can see a file or directory’s associated OST(s) with “lfs getstripe “. “lfs df” can be used to see the usage on each OST.

Miscellaneous

What “endian”ness is the XT3 and XT4? Is there any way to affect it?

The Cray XT3 and XT4 are little-endian. There is a compiler switch -Mbyteswapio that makes the default Fortran unformatted I/O big-endian (read and write.)

Note that this little-endian-to-big-endian conversion feature is intended for Fortran unformatted I/O operations. It enables the development and processing of files with big-endian data organization. The feature also enables processing of the files developed on processors that generate big-endian data (such as IBM, Cray X1, Sun).

How can I check memory usage for my application on the XT3?

If you don’t use allocatable memory, size executable_name is a reliable way to check the memory usage of your application.

Heap usage can be checked with the UNICOS/lc system call heap_info. An example of usage in C would be as follows:

       #include <stdio.h>      

       #include <catamount/catmalloc.h>      

       void       mem_check ()      

       {      

         size_t fragments;      

         unsigned long total_free, largest_free, total_used;      

         if (heap_info(&fragments, &total_free, &largest_free, &total_used) == 0) {      

printf(      

           “heap_info fragments=%lu total_free=%lu largest_free=%lu total_used =%lun,      

              fragments, total_free, largest_free, total_used);      

         } else {      

           printf(“non zero return code from heap_infon);      

         }      

return;      

       }

An example of usage in Fortran would be as follows:

     program heap      

        integer i      

        integer*4 fragments      

        integer*8 total_free, largest_free, total_used      

        integer heap_info      

        i = heap_info(fragments, total_free, largest_free, total_used)      

        write(0,*) 'heap_info fragments =',fragments,' total_free = ',      

       1total_free,' largest_free = ',largest_free,' total_used = ',      

       2total_used,' i = ',i      

        stop      

     end

(Both these examples can be found on the man page for heap_info).

Interrogating stack usage is a bit more involved.

  #include <qk/types.h>      

  #include <qk/process_pcb_type.h>      

  PROCESS_PCB_TYPE *_my_pcb;      

  inline ADDR_LEN get_stack_pointer() {    ADDR_LEN sp;      

    asm(“mov %%rsp,%0″ : “=m” (sp));      

    return sp;      

  }      

  /* Returns the free space on the stack after allocating n more bytes.      

   * If this overflows, aborts instead of returning. */      

  unsigned check_stack( int n ) {      

#define NN (int)((get_stack_pointer() - _my_pcb->stack_base) - (n+16))      

    if ( NN >= 0 ) return NN;      

    abort();      

  }

What profiling tools are available?

At least three profiling tools are available on Jaguar.

  1. CrayPat is provided by Cray. Follow this link for more information.
  2. fpmpi is an unsupported product that can provide a very concise profile of MPI routines in an application. To use it, simply load the fpmpi (or fpmpi_papi) module and relink. Then rerun your application. There are a few environment variables to control profiling output:
    • MPI_PROFILE_DISABLE : Disables statistic collection until fpmpi_enable is called (#include fpmpi.h).
    • MPI_PROFILE_SUMMARY : Setting disables creation of individual MPI process statistics files. Should set this when running with 1000s of processes.
    • MPI_PROFILE_FILE : Name of process statistic file; default is profile.txt.
    • MPI_HWPC_COUNTERS : List of events or event set number as in libhwpc.
  3. A third tool that is unsupported is TAU. TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. Basic profiling with TAU can be done in the following steps:
    1. In your makefile or configuration script, type TAUROOTDIR=/apps/TAU/prod/jaguar.
    2. Then type include $(TAUROOTDIR)/lib/Makefile.tau-pdt-pgi.
    3. Build your code using the modified makefile or configuration script. Contact help@nccs.gov if there are any error messages at this stage. TAU will revert to normal build process if the automatic instrumentation is unsuccessful (i.e., you won’t get an instrumented file).
    4. You will get a regular executable. Submit your job as usual.
    5. After execution, there should be a profile.xxx text file.

TAU can also do MPI profiling and collect hardware performance counter data.

How do I get performance counter data for my program?

Use the following process:

  1. Use module load craypat.
  2. Compile code.
    1. If Fortran90 with modules, compile with -Mprof=func.
  3. Run pat_build -u -g mpi a.out.
  4. Run a.out+pat as you would a.out, BUT make sure PAT_RT_HWPC is set to 1 in batch script.
    1. If you want just a regular profile, don’t set PAT_RT_HWPC.
  5. Run pat_report <dir>/*.xf, where <dir> is automatically generated by instrumented code.

The resulting output will have performance counter results for the entire run AND for each subroutine.

Where can I find documentation on MPI environment variables?

You can find current information on MPI environment variables from the mpi_intro man page.

Where can I find more information?

If you haven’t already, please check out the other Jaguar resource pages at Jaguar resources on compiling, file systems, batch jobs, open issues, parallel I/O tips, CrayPAT overview, and other reports and presentations related to Jaguar.

Another good resource (without Jaguar-specific information) is the documentation that Cray provides at CrayDocs.