Preview only show first 10 pages with watermark. For full document please download

Cray Xt3™ Systems Software Release Overview S–2425–12




Cray XT3™ Systems Software Release Overview S–2425–12 © 2005 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, GigaRing, LibSci, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C++ Compiling System, Cray C90, Cray C90D, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray S-MP, Cray SHMEM, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray T90, Cray T916, Cray T932, Cray UNICOS, Cray X1, Cray X1E, Cray XD1, Cray X-MP, Cray XMS, Cray XT3, Cray Y-MP EL, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, Cray/REELlibrarian, CraySoft, CrayTutor, CRInform, CRI/TurboKiva, CSIM, CVT, Delivering the power..., Dgauss, Docview, EMDS, HEXAR, HSX, IOS, ISP/Superlink, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc. AMD is a trademark of Advanced Micro Devices, Inc. Copyrighted works of Sandia National Laboratories include: Catamount/QK, Compute Processor Allocator (CPA), and xtshowmesh. DDN is a trademark of DataDirect Networks. Linux is a trademark of Linus Torvalds. Lustre was developed and is maintained by Cluster File Systems, Inc. under the GNU General Public License. MySQL is a trademark of MySQL AB. PBS Pro is a trademark of Altair Grid Technologies. SUSE is a trademark of SUSE LINUX Products GmbH, a Novell business. The Portland Group and PGI are trademarks of STMicroelectronics. TotalView is a trademark of Etnus, LLC. UNIX, the X device, X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. Contents Page Introduction [1] 1 Emphasis for the Cray XT3 Systems Software Releases . . . . . . . . . . . . . 1 Configurations Supported . . . . . . . . . . . . . . . . . . . . . . 1 Supported Upgrade Path . . . . . . . . . . . . . . . . . . . . . . 2 Conventions . . Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . 3 Software Enhancements [2] Operating System . . . Programming Environment Input/Output System CRMS . . . Optional Products . 5 . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . 9 . . . . . . . . . . . . . . . . . . 11 . . . . . 11 . . . . . . 12 . . . . . . 12 . . . . . . 12 . . . 12 Added Commands and Functions Enhanced Commands and Functions Fixed Cray XT3 1.1 Limitations . . Group ID Not Yet Supported From Trusted Location for I/O Libraries Access PBS Pro Attempts to Run Old Jobs After a Reboot . . . . . . . . PBS Pro May Not Run Jobs Although Batch Compute Nodes are Available xtnid2str Utility Not Included in Release . . . . . . . . . Default Linking to ACML and Cray LibSci May Not link to ACML LAPACK Routines libsysio Must Include Path to the Executables Directory . . . I/O Locking When Catamount Application Terminates Abnormally . . . . . . . . 13 . . . . . . . . 13 . . . 13 Application Compiled with Lustre Does Not Produce Remaining Application Output Cray SHMEM Use Limited to Memory Regions of <= 2 GB Workaround for Unexpected EOF in CrayPat Data Files S–2425–12 . . . . . . . . . . . . 14 . . . . . . . . . . . 14 i Cray XT3™ Systems Software Release Overview Page Lustre Callback Getting Freed Buffer Fixed Critical and Urgent SPRs . . SUSE LINUX Security Advisory Notes . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . 15 Compatibilities and Differences [3] Users Must Recompile Applications 17 . . . . . . . . . . . . . . . . . . . 17 Object and Module File Incompatibility . . . . . . . . . . . . . . . . . . 17 Integer Variable Incompatibility . . . . . . . . . . . . . . . . . . . 17 Restrictions on Large Data Objects . . . . . . . . . . . . . . . . . . . 17 Restriction on Portals Message Size . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . 18 . Fortran Module for Cray SHMEM not Supported PGI -Mipa Option May Cause Node to Hang . . . . . . . . . . . . . . . . 18 Maximum Number of Open Files is Limited . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . 18 Using the kill Command to Terminate yod . . . . . . . . . . . . . . . . 19 xtlcbsnap to be Replaced by xtnetwatch . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . 19 Lustre Network Protocol Incompatibility . Possible Need to Check ext2 File System . Documentation [4] 21 CrayDoc Documentation Delivery System . . . . . . . . . . . . . . . . . 21 Accessing Product Documentation . . . . . . . . . . . . . . . . . . . 21 Books Provided with This Release . . . . . . . . . . . . . . . . . . . 22 . . . . . . . . . . . . . . . . . . 23 Third-party Books Provided with This Release . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . 24 Man Pages Provided with This Release Third-party Man Pages Provided with This Release Additional Documentation Resources . . TotalView Documentation from Etnus, LLC Cray Glossary . . . . . . . . . . . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . 25 Release Package [5] 27 Hardware and Software Requirements Optional Products Supported ii . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . 27 S–2425–12 Contents Page TotalView from Etnus, LLC . Contents of the Release Package Licensing . . Ordering Software . . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . 30 Customer Services [6] 31 Technical Assistance with Software Problems . . . . . . . . . . . . . . . . Index 31 33 Tables Table 1. Added Commands and Functions . . . . . . . . . . . . . . . 8 Table 2. Enhanced Commands and Functions . . . . . . . . . . . . . . . 9 Table 3. Fixed Critical and Urgent SPRs . . . . . . . . . . . . . . . . 15 Table 4. Books Provided with This Release . . . . . . . . . . . . . . . . 22 Table 5. PBS Pro Books Provided with This Release . . . . . . . . . . . . . . 22 Table 6. Third-party Books Provided with This Release . . . . . . . . . . . . 23 Table 7. Additional Documentation Resources . . . . . . . . . . . . 24 S–2425–12 . . . . . iii Introduction [1] This document is intended to give application programmers and system administrators an overview of the Cray XT3 systems 1.2 software releases. 1.1 Emphasis for the Cray XT3 Systems Software Releases The key reasons for this release are: • Substantial improvements in Portals latency (see Section 2.3) • Support of Lustre parallel file systems for service and compute nodes (see Section 2.3) • Enhanced version checking (see Section 2.1) • Portals and SeaStar link monitoring (see Section 2.4, page 7) • Etnus support of the TotalView 7.0 debugger at larger scale (up to 360 nodes) (see Section 2.1) • Added commands and functions (see Section 2.6, page 7) • Enhanced commands and functions (see Section 2.7, page 9) • Fixed SPRs (see Section 2.9, page 14) 1.2 Configurations Supported The Cray XT3 1.2 release supports the following: • Systems with up to 50 cabinets. • Applications running on up to 3,000 nodes, with some limitations starting at 256 nodes. • Applications with a memory size of up to 4 GB minus system memory. • System memory size of up to 4 GB. • TotalView running on a maximum of 360 nodes. TotalView requires one suitably configured service node for each 64 compute nodes to be debugged. Please contact your Cray Service Representative for assistance in configuring TotalView. S–2425–12 1 Cray XT3™ Systems Software Release Overview • Service nodes: 1 login node, 1 boot node, 1 database node, 1 syslog node, and 10 network nodes. • Systems with 8 OSS's (object storage servers) if Lustre is configured. 1.3 Supported Upgrade Path The supported upgrade path is from the Cray XT3 1.1 release to the Cray XT3 1.2 release. 1.4 Conventions These conventions are used throughout Cray documentation: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. user input This bold, fixed-space font denotes literal items that the user enters in interactive sessions. Output is shown in nonbold, fixed-space font. [] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. ... Ellipses indicate that a preceding element can be repeated. name(n) Denotes man pages that provide system and programming reference information. Each man page is referred to by its name followed by a section number in parentheses. Enter: % man man to see the meaning of each section number for your particular system. 2 S–2425–12 Introduction [1] 1.5 Reader Comments Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: [email protected] Telephone (inside U.S., Canada): 1–800–950–2729 (Cray Customer Support Center) Telephone (outside U.S., Canada): +1–715–726–4993 (Cray Customer Support Center) Mail: Software Publications Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA S–2425–12 3 Cray XT3™ Systems Software Release Overview 4 S–2425–12 Software Enhancements [2] This chapter describes the enhancements that have been made with this release. For compatibility issues and differences that you should be aware of when installing this release or using these products, see Chapter 3, page 17. The Cray XT3 Systems 1.2 Release Errata describes temporary limitations of this release and changes identified after the documentation for this release was packaged. The Cray XT3 System Overview gives a high-level description of Cray XT3 software and hardware components. For a list of all documentation provided with the Cray XT3 systems 1.2 release package, see Chapter 4, page 21. 2.1 Operating System The UNICOS/lc operating system supports: • Enhanced version checking. In UNICOS/lc 1.1 and earlier releases, three components—yod, the process control thread (PCT), and an application—each had a release version string. If the release version strings were incompatible, a user attempting to build or run an application would get a Version does not match message. The solution was to recompile. In UNICOS/lc 1.2, the system has been enhanced to ensure that yod, PCT, and an application are compatible and will interact reliably. A protocol version string is encoded in each component. All protocol version strings will be the same unless the user has compiled an application and a new release subsequently is installed that uses a different protocol version. For further information, see Section 3.1, page 17, the Cray XT3 System Overview, and the Cray XT3 Programming Environment User's Guide. • TotalView debugging at larger scale. The TotalView debugger from Etnus, LLC, provides source-level debugging of applications. TotalView can debug applications running on 1-360 compute nodes. S–2425–12 5 Cray XT3™ Systems Software Release Overview 2.2 Programming Environment The Cray XT3 Programming Environment supports: • AMD Core Math Library version 2.6. New features in Version 2.6 include: – Better Level 3 BLAS performance with small problem sizes. – New fast scalar log10(), pow(), powf(), and vector log10() and powf() routines. • PGI version 6.0-4 2.3 Input/Output System The Cray XT3 I/O system supports: • Portals optimization. The Cray XT3 Portals implementation has been optimized in the Cray XT3 1.2 release with enhancements in the network abstraction layer (NAL), enhancements in the driver layers, and a rewrite by Sandia National Laboratories of the SeaStar Portals firmware in C. • Lustre parallel file systems for service and compute nodes. Administrators can set up Lustre as a parallel file system by configuring a metadata server (MDS) and multiple object storage servers (OSSs). A Lustre parallel file system is optimal for jobs requiring large sequential file access. File striping can be used to optimize storage access across the OSTs. For further information, see the Cray XT3 System Management manual and the Cray XT3 System Overview. • Lustre no longer requires the administrator to start up OSTs and the MDS in sequence, nor does the administrator have to specify --failover in the configuration file. The MDS will wait until all OSTs have started before finishing its startup process. • Increased default size of the Portals heap. In Cray XT3 1.1 systems, submitting a large number of concurrent jobs could cause Portals to run out of pre-allocated static heap space. If this occurred, Portals issued an Operation not permitted message and aborted the executable. The Cray XT3 1.2 release adds a new method of sizing the Portals private heap and removes the hard-coded default value of 60 MB. 6 S–2425–12 Software Enhancements [2] When a login node is booted, the system uses the following formula for setting the Portals private heap size: phs = 100 * (login_mem_gb/2GB), where login_mem_gb is the amount of memory in gigabytes on the login node being booted. If, for example, the login node has 4 GB of memory, the Portals heap size on the node is 200 MB. 2.4 CRMS The Cray RAS and Management System (CRMS) supports: • Portals and SeaStar link monitoring. CRMS has been enhanced to automatically and continuously monitor Portals and SeaStar links. Whenever the System Interconnection Network is running, L0-based router daemons send event messages containing link control block (LCB) and router errors to the /opt/craylog/eventlog file. For further information, see the Cray XT3 System Management manual. • Enhanced system boot and dump analysis. See Section 2.6, page 7 and Section 2.7, page 9 for further information. 2.5 Optional Products The Cray XT3 system supports the following optional products: • CrayPat 1.1.0 performance analysis tool (available from Cray) • Cray Apprentice2 2.3.1 performance data visualization tool (available from Cray) • Etnus TotalView (available from Etnus, LLC) • PBS Pro 5.3.2xt (available from Cray) 2.6 Added Commands and Functions The following commands and functions were added with this release. S–2425–12 7 Cray XT3™ Systems Software Release Overview Table 1. Added Commands and Functions Command or Function Description shmem_put_nb(3) The shmem_put_nb(3) nonblocking put functions return immediately, possibly before the data has been copied out of the source array on the local processing element. xtchecklink(8) The xtchecklink(8) command checks symbolic links for nonexistent files. xtdumpsys(8) The xtdumpsys(8) command gathers administrator-specified types of information when a system is hung or crashes and writes the data to a time-stamped directory in /opt/craydump. Inside that directory are all the files associated with that xtdumpsys invocation. Administrators can create plug-ins for analysis of this data. 8 xtmemwatch(8) The xtmemwatch(8) command watches memory locations change. Options enable the operator to specify the address in memory to watch and the ID of a node or SeaStar chip whose memory is to be watched. xtnetwatch(8) The xtnetwatch(8) command tells the router daemons to sample for LCB and router errors. xtnetwatch then listens for events sent by the daemons and displays the information in an operator-specified format. xtnidname(8) The xtnidname(8) command converts between node names and node IDs (NIDs). xtopteronmca(8) The xtopteronmca(8) command decodes Opteron machine check architecture (MCA) error messages sent to the System Management Workstation (SMW) console. S–2425–12 Software Enhancements [2] Command or Function Description xtrcaproxy(8) The xtrcaproxy(8) command allows the administrator access to L0 RCA channels from the SMW via a TCP socket. The command displays the number of the port it is listening on. The administrator can use this port number for connecting to the RCA channels with local clients such as telnet or gdb. xtshowcabs(1) The xtshowcabs(1) command shows status information about compute and service nodes, organized by chassis (cage) and cabinet. The xtshowcabs and xtshowmesh commands are related. Use xtshowmesh on systems with topology class 0 or 4 and xtshowcabs on systems with topology class 1, 2, or 3. 2.7 Enhanced Commands and Functions The following commands and functions were enhanced with this release. Table 2. Enhanced Commands and Functions Command or Function S–2425–12 Description e2fsck(8) The e2fsck(8) command checks Lustre ext3 backend file systems. e2image(8) The e2image(8) command saves critical Lustre ext3 backend file systems data. pat_build(1) Added the -S file option, which enables or disables tracing symbols defined in file. 9 Cray XT3™ Systems Software Release Overview Command or Function Description pdsh(1) Added the following options: • -w namelist runs the command on the specified nodes. • -x namelist excludes the specified nodes. Enhanced the following options: • -P uses the nodes that succeeded or failed during the previous command as the nodelist. • -R filename stores return codes in filename. • -L doesn't prepend nodename to stdout. xtbootsys(8) Added the following options: • -r 'reason' specifies a reason for booting the system. • -s session-id specifies a session identifier from a previously started boot session. This allows the operator to keep all the data from a single boot session together. • -V displays version information. xtcli(8) Added the -f failmode option, which the administrator can use to set the failure mode to one of the following values: • HALT — Halt on failure. Only the first error occurrence will be reported; execution will stop on error. This is the default. • CONT — Continue on failure. All error occurrences will be reported; execution will not stop on error. • LOOP — Loop on failure. On error occurrence, the failing behavior will be repeated in a loop. At a minimum, the failing test will loop. The -f LOOP option applies only to those errors that have this behavior defined and is intended for system debugging. 10 S–2425–12 Software Enhancements [2] Command or Function Description xtgenacct(8) The administrator can now generate accounting reports using the xtgenacct command from any node. Options enable the administrator to specify start and end times. xtinitmodule(8) Added the following options: • The -b option initializes hardware for use with the xtibmbist command. • The -r routesfile option specifies the name of the generated routing information file. • The [svcid,svcid,...] option specifies a comma-separated list of service IDs. If service IDs are specified, then a node list file is generated; if not, then an administrator-created node list file is expected to already exist. Service IDs can designate entire systems, cabinets, chassis, or blades. xtfwlog(8) Output of the Cray SeaStar firmware log has been enhanced. Fewer messages are displayed and the messages are more descriptive. 2.8 Fixed Cray XT3 1.1 Limitations The following limitations and problems noted in the Cray XT3 1.1 release errata have been fixed with the 1.2 release. 2.8.1 Group ID Not Yet Supported From Trusted Location for I/O Libraries Access I/O commands need to check the user identifier (UID) and group identifier (GID) of the caller to ensure the caller has permission to do the operation. The UID and GID need to come from a trusted location, such as the Portals header. On Cray XT3 1.1 systems, only the UID was supported. On Cray XT3 1.2 systems, both the UID and GID are supported; the Lustre file system now acknowledges a user's secondary groups when determining file permissions. Associated SPR: 732016 S–2425–12 11 Cray XT3™ Systems Software Release Overview 2.8.2 PBS Pro Attempts to Run Old Jobs After a Reboot For Cray XT3 systems running the optional PBS Pro product. On Cray XT3 1.1 systems, when PBS Pro is not shutdown properly and the system is rebooted, PBS Pro treats jobs that were running prior to the system being rebooted as if they were still running and does not let new jobs run. This limitation has been fixed for Cray XT3 1.2 systems. No workaround is required. Associated SPR: 731827 2.8.3 PBS Pro May Not Run Jobs Although Batch Compute Nodes are Available For Cray XT3 systems running the optional PBS Pro product. On Cray XT3 1.1 systems, PBS Pro may not run jobs when a login node hangs even though the xtshowmesh command shows available PBS Pro batch compute nodes. When this happens, the queue manager (qmgr) displays a negative count of resources_assigned.size, and the PBS Pro scheduler logs a message such as the following: 06/17/2005 14:54:05;0080; pbs_sched;Svr;WARNING;server's assigned nodes -2 is less than total running 16. This limitation has been fixed for Cray XT3 1.2 systems. No workaround is required. Associated SPR: 732257 2.8.4 xtnid2str Utility Not Included in Release The xtnid2str utility, which converts NIDs to physical IDs, did not get packaged with the Cray XT3 1.1 release; this utility is packaged with the Cray XT3 1.2 release. 2.8.5 Default Linking to ACML and Cray LibSci May Not link to ACML LAPACK Routines Using the Cray XT3 1.1 release, users who call LAPACK routines may not be linking to the expected routines in ACML. Some LAPACK routines were inadvertently duplicated in the Cray LibSci library and, because of the default ordering of libraries in the compiler scripts, these LibSci LAPACK routines are linked first. The ACML LAPACK routines are the preferred routines, because 12 S–2425–12 Software Enhancements [2] they have been tuned and tested for the Opteron processors. This limitation has been fixed with the Cray XT3 1.2 release. Associated SPR: 732153 2.8.6 libsysio Must Include Path to the Executables Directory Using the Cray XT3 1.1 release, at startup, optimizations in libsysio can cause an abort if the directory that the job is launched from is not included in the compute node's file system namespace. The abort can occur if /etc/sysio_init does not give access to the directory from which the job is launched, or if the directory it is launched from is specified in sysio_init to be Lustre hosted and the application is not linked with the Lustre libraries. This limitation has been fixed with the Cray XT3 1.2 release. Associated SPR: 731570 2.8.7 I/O Locking When Catamount Application Terminates Abnormally Using the Cray XT3 1.1 release, if a Catamount application terminates abnormally while doing I/O to files previously opened by the application, the next time you access those files (ls or otherwise), it will take 6 seconds per file per OST to time out the associated lock. While this lock freeing is occurring (each lock takes 6 seconds), all others logged on to the system cannot use the Lustre file system, and it appears to users as if the Lustre file system has hung. For example, if you launched a program to 1000 nodes to write to the same file and the program aborted, it will take 6000 seconds (1.6 hours) to remove this one file, since it needs to time out the locks of this file associated to all clients. This limitation has been fixed with the Cray XT3 1.2 release. Associated SPR: 732273 2.8.8 Application Compiled with Lustre Does Not Produce Remaining Application Output Using the Cray XT3 1.1 release, when the Lustre file system is used, the file descriptors are shutdown prior to fflush() being called in the application's exit() call. This results in final output in buffers not being written to their final locations. This limitation has been fixed with the Cray XT3 1.2 release. Associated SPR: 732278 S–2425–12 13 Cray XT3™ Systems Software Release Overview 2.8.9 Cray SHMEM Use Limited to Memory Regions of <= 2 GB Cray SHMEM can be used only on programs requesting <= 2 GB symmetric heap, <= 2 GB private heap, and <= 2 GB stack. Most applications are not expected to see the limitation. Associated SPR: 732099 2.8.10 Workaround for Unexpected EOF in CrayPat Data Files For programmers who use CrayPat, the CrayPat component pat_report issues the message Unexpected eof after N records for a data (.xf) file produced by a program that was linked to use the Lustre file system and then instrumented with pat_build. In the resulting report, the performance data will be missing or incomplete. The impact is incomplete or no performance data is recorded by an instrumented program built from a program linked to use the Lustre file system. This limitation has been fixed with the Cray XT3 1.2 release. Associated SPR: 732332 2.8.11 Lustre Callback Getting Freed Buffer Using the Cray XT3 1.1 release, there is a known problem that can cause a Lustre file system to become inoperable. The symptom is that a Lustre node runs into an assertion after some file activities, and the following message is displayed in syslog and on the system's console monitor: LustreError: 590:0:(events.c:341:ptlrpc_master_callback()) ASSERTION(cbid->cbid_arg != LP_POISON) failed LustreError: 590:0:(module.c:46:kportal_assertion_failed()) LBUG Soon after the above message displays, the node will crash, and the Lustre file system involved with this node will no longer be functional. This limitation has been fixed with the Cray XT3 1.2 release. Associated SPR: 732188 2.9 Fixed Critical and Urgent SPRs The following customer-filed critical and urgent SPRs are closed with this release. For a list of all SPRs closed with this release, access the Cray SPR database. 14 S–2425–12 Software Enhancements [2] Table 3. Fixed Critical and Urgent SPRs SPR Number Description 725116 I/O LIBRARIES MUST GET UID/GID FROM SECURE PLACE/PORTALS HEADER 730398 THE 3RD ARGUMENT TO SIGNAL HANDLER WHEN SA_SIGINFO IS SET DOESN'T CONFORM TO PSX 732045 "IOR" AND "IOW" FIELDS IN YOD_ACCOUNTING TABLE ARE ALWAYS HUGE 732320 XTGENACCT DOES NOT DISPLAY CORRECT SYSTEM AVAILABILITY NUMBERS 732463 L1SYD DOES NOT ALLOW FAN SPEED CHANGE, AND MODE CHANGE 732483 "REGISTER IOCTL TO USER-KERNEL CDEV FAILED:" IF MORE THAN 70 SINGLE PROC JOB RUN 732670 L1SYSD COMMAND DOESN'T SET FAN SPEED. 732751 HPCC FAILS UNDER UNICOS/LC 1.2.05 732832 "REGISTER IOCTL TO USER-KERNEL CDEV FAILED:" IF MORE THAN 70 SINGLE PROC JOB RUN 732936 PLEASE LOOK AT PSC DUMP 0509071010 2.10 SUSE LINUX Security Advisory Notes There are no SUSE LINUX security advisories included with the base Cray XT31.2 release. Advisories will be addressed with Cray XT3 1.2 updates. S–2425–12 15 Cray XT3™ Systems Software Release Overview 16 S–2425–12 Compatibilities and Differences [3] This chapter describes compatibility issues and functionality differences for this release. For a description of Cray XT3 1.2 release temporary limitations and known software problems, see the Cray XT3 Systems 1.2 Release Errata. 3.1 Users Must Recompile Applications Because of required changes made in the Cray XT3 1.2 release, users must recompile applications when moving from the Cray XT3 1.1 release to the Cray XT3 1.2 release or from any Cray XT3 1.2 prerelease version to the official Cray XT3 1.2 release. ! Caution: Not recompiling an application will result in undefined behavior and may cause the application to fail or hang or may cause a Cray XT3 node to fail. 3.2 Object and Module File Incompatibility Object and module files created using PGI 6.0 compilers are incompatible with object files from previous Cray XT3 software that uses earlier versions of the PGI compilers. 3.3 Integer Variable Incompatibility The -i8 compiler option can make programs incompatible with Cray MPICH2 and ACML functions. Typically, the use of any INTEGER*8 array size argument can cause failures with these libraries. 3.4 Restrictions on Large Data Objects The PGI compilers support data objects larger than 2 GB. However, the Cray XT3 1.2 Programming Environment has restrictions in this area. To operate on large data sets, an application must: • Be compiled for the small memory model (this is the default) • Limit static data (.text + .bss) sections to less than 2 GB • Allocate data objects that are larger than 2 GB dynamically S–2425–12 17 Cray XT3™ Systems Software Release Overview • Be compiled with the -Mlarge_arrays option • Restrict library accesses to objects less than 2 GB (that is, MPI, SHMEM, and Cray XT3 LibSci library calls must be on data objects less than 2 GB in size) The Cray XT3 1.2 user level libraries are compiled in the small memory model format. For more information about memory models, see the PGI Server 6.0 and Workstation 6.0 Installation and Release Notes and the PGI User's Guide. 3.5 Restriction on Portals Message Size A single Portals message cannot be longer than 2 GB. 3.6 Fortran Module for Cray SHMEM not Supported The Fortran module for Cray SHMEM is not supported. Use the INCLUDE 'mpp/shmem.fh' statement instead. This requirement is also documented in the Cray XT3 Programming Environment User's Guide. 3.7 PGI -Mipa Option May Cause Node to Hang In some cases, users compiling very large programs with the -Mipa option may cause a PGI process to use all the available memory on the login node, creating an out-of-memory situation and causing the node to hang. The workaround is to remove the -Mipa option and recompile. Associated SPR: 732505. This problem has been submitted to PGI; they are currently investigating the problem. 3.8 Maximum Number of Open Files is Limited The Cray XT3 1.2 system limits the number of open unique files in any job that uses yod for application I/O. The number of unique files that yod can open is 1024, the Linux per-process open file descriptor limit. 3.9 Lustre Network Protocol Incompatibility The network protocol for Lustre 1.2 is incompatible with previous versions of Lustre. See the Cray XT3 System Management manual for details on updating servers and clients to run the new version and updating the configuration logs. 18 S–2425–12 Compatibilities and Differences [3] 3.10 Using the kill Command to Terminate yod To end a job abnormally, do not use kill -9yod_pid to terminate yod. The command kills yod, but yod does not finish cleanup activities, leaving orphan compute nodes. Use the following process for terminating a job: 1. If you are working in PBS Pro, use the qdel(1) command. 2. If you are working in interactive mode: a. Enter the kill -TERM command to allow yod and PCT to clean up after job termination (kill -TERM is equivalent to kill -15), or b. Enter a single Ctrl-c (equivalent to kill -15). Do not enter a second Ctrl-c; that is equivalent to kill -9. See the kill(1) man page for details. 3.11 xtlcbsnap to be Replaced by xtnetwatch The xtlcbsnap command takes a snapshot of the current LCB error logs and all router port error memory-mapped registers (MMRs). xtlcbsnap is supported in the Cray XT3 1.2 release but will be replaced by xtnetwatch in the Cray XT3 1.3 release. xtnetwatch watches the Cray XT3 system interconnection network for LCB and router errors. 3.12 Possible Need to Check ext2 File System There is a suspected bug in the SUSE LINUX 2.4.21 ext3 file system that may require you to use the e2fsck command to check the file system. If the file system is not checked, Lustre may enter recovery mode, resulting in slow performance. Cluster File Systems reports that they have not seen this problem, so we expect the problem to be rare. It is suspected that the problem occurs when the system is repeatedly rebooted without following proper shutdown procedures. See the Cray XT3 System Management for a description of shutdown procedures. S–2425–12 19 Cray XT3™ Systems Software Release Overview 20 S–2425–12 Documentation [4] This chapter describes the documentation that supports the Cray XT3 systems software releases. 4.1 CrayDoc Documentation Delivery System The CrayDoc documentation delivery system, along with product documentation, is provided with each Cray software release. The CrayDoc software runs on any operating system based on UNIX systems or systems like UNIX including Mac OS X, Linux, BSD, and anywhere else that Perl and Apache can be compiled from source code with freely available (GNU) tools. The installation and administration of the CrayDoc server software and Cray documentation are described in the CrayDoc Installation and Administration Guide. 4.2 Accessing Product Documentation With this release, Cray provides books, man pages, and third-party documentation. These documents are provided in the following ways: • CrayDoc, the Cray documentation delivery system that allows you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. Access this HTML and PDF documentation via CrayDoc at the following URLs: – The local network location defined by your system administrator – The CrayDoc public website: • Man pages—Access man pages by entering the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man • Third-party documentation not provided through CrayDoc. Access this documentation, if any, according to the information provided with that product. S–2425–12 21 Cray XT3™ Systems Software Release Overview 4.3 Books Provided with This Release The books provided with this release are listed in Table 4, which also notes whether each book was updated. Many books are provided in HTML and all are provided in PDF. Note: The Cray XT3 Systems 1.2 Release Errata includes a description of temporary limitations of this release and changes identified after the documentation for this release was packaged. You should also contact your Cray representative for other possible late problems published in Field Notices (FNs). Table 4. Books Provided with This Release Book Title Number Updated Cray XT3 Systems Software Release Overview (this document) S–2425–12 Yes Cray XT3 Software Installation and Configuration Guide S–2444–12 Yes Cray XT3 System Overview S–2423–12 Yes Cray XT3 System Management S–2393–12 Yes Cray XT3 Programming Environment User's Guide S–2396–12 Yes CrayDoc Installation and Administration Guide S–2340–40 No If your site has ordered the PBS Pro product for your Cray XT3 system, the following books are also provided. All PBS Pro books are provided in PDF. The PBS Pro Release Overview, Installation Guide, and Administration Addendum for Cray XT3 Systems is provided in PDF and HTML format. Table 5. PBS Pro Books Provided with This Release Book Title Number Updated PBS Pro Release Overview, Installation Guide, and Administration Addendum for Cray XT3 Systems S–2438–532xt Yes PBS Pro 5.3 User Guide, PBS-3BU01 S–6500–53 No PBS Pro 5.3 External Reference Specification, PBS-3BE01 S–6501–53 No PBS Pro 5.3 Administrator Guide, PBS-3BA01 S–6502–53 No PBS Pro 5.3 Quick Start Guide, PBS-3BQ01 S-6510–53 No 22 S–2425–12 Documentation [4] 4.4 Man Pages Provided with This Release • Compiler commands: cc(1), CC(1), ftn(1), f77(1) • Application launch commands: yod(1), xtshowmesh(1), xtshowcabs(1) • Cray-specific MPI man page: intro_mpi(1) • Cray SHMEM man pages: start with intro_shmem(1) • Single-system view (SSV) man pages: xthostname(1), xtkill(1), xtps(1), xtwho(1) • UNICOS/lc man pages : start with intro_xt3(1) • Cray Linux man pages • Modules software package man pages: module(1), modulefile(4) If your site ordered CrayPat, man pages are provided: start with craypat(1). If your site ordered Cray Apprentice2, the app2(1) man page is provided.1 If your site ordered PBS Pro, man pages are provided: start with pbs(1B). 4.5 Third-party Books Provided with This Release Table 6. Third-party Books Provided with This Release 1 Book Title Number Updated PGI User's Guide S–6516–60 No PGI Fortran Reference S–6518–60 No PGI Tools Guide S–6517–60 No PGI Server 6.0 and Workstation 6.0 Installation and Release Notes S–6539–60 No AMD Core Math Library (ACML) S–6511–26 Yes PAPI User's Guide S–6515–306 No PAPI Programmer's Reference S–6514–307 No PAPI Software Specification S–6531–30 No In addition, the Cray Apprentice2 GUI provides online help. S–2425–12 23 Cray XT3™ Systems Software Release Overview Book Title Number Updated SuperLU Users' Guide S–6532–10 No FLEXlm End Users Guide S–6508–95 No 4.6 Third-party Man Pages Provided with This Release Man pages are provided for the following third-party products: • MPICH2 • LAPACK • ScaLAPACK • BLACS • PAPI • SUSE LINUX • Lustre 4.7 Additional Documentation Resources Table 7 lists the resources for obtaining documentation not included in this release or for documentation in addition to that included in the release. Table 7. Additional Documentation Resources Product Documentation Source MPICH2 Additional documentation is available in HTML and PDF formats from the Argonne National Laboratory website at Additional information about the MPI-2 standard is available at ScaLAPACK The ScaLAPACK Users' Guide and ScaLAPACK tutorial are available in HTML format at SuperLU Additional SuperLU documentation is available at 24 S–2425–12 Documentation [4] Product Documentation Source Lustre Additional Lustre documentation is available at PAPI Additional PAPI documentation is available at MySQL MySQL documentation is available at DHCP DHCP documentation is available at FLEXlm Additional FLEXlm documentation is available at glibc glibc documentation is available at GNET GNET documentation is available at GLIB GLIB documentation is available at RPM RPM documentation is available at 4.8 TotalView Documentation from Etnus, LLC TotalView books and man pages for Cray XT3 systems are available from Etnus, LLC. For information about TotalView publications, see 4.9 Cray Glossary A Cray Glossary of terms specific to the Cray XT3 system is included with CrayDoc. The entire Cray Glossary is available on the CrayDoc public website: S–2425–12 25 Cray XT3™ Systems Software Release Overview 26 S–2425–12 Release Package [5] This chapter contains the following information about the Cray XT3 1.2 software releases: • Hardware and software requirements (Section 5.1, page 27) • Optional products supported (Section 5.2, page 27) • TotalView from Etnus, LLC (Section 5.3, page 28) • Contents of the release package (Section 5.4, page 28) • Licensing (Section 5.5, page 29) • Ordering software (Section 5.6, page 30) 5.1 Hardware and Software Requirements The supported upgrade path is from the Cray XT3 1.1 release to the Cray XT3 1.2 release. The following products run on Cray XT3 systems: • UNICOS/lc 1.2 • Cray XT3 Programming Environment 1.2 • System Management Workstation 1.2 • CRMS 1.2 5.2 Optional Products Supported The Cray XT3 1.2 software releases support the following optional products offered directly from Cray Inc.: • PBS Pro 5.3.2xt • CrayPat 1.1.0 • Cray Apprentice2 2.3.1 S–2425–12 27 Cray XT3™ Systems Software Release Overview 5.3 TotalView from Etnus, LLC You can order a special implementation of the TotalView debugger for Cray XT3 systems from Etnus, LLC. You cannot order TotalView directly from Cray Inc. TotalView provides source-level debugging of MPI applications and is compatible with the PGI Fortran, C, and C++ compilers. For information about ordering, installing, using, and maintaining TotalView, see 5.4 Contents of the Release Package The Cray XT3 systems 1.2 release package includes: • UNICOS/lc 1.2, which includes: – Linux kernel 2.4 and SUSE LINUX 8.2 beta – Catamount 1.2 microkernel Also included with the UNICOS/lc release package are these related products: – CRMS 1.2 – Lustre 1.2 – GNet 2.0.5 network library – Modules 3.1.6 user environment management utility – RPM Package Manager 4.1.1 – MySQL 4.0 database manager – FLEXlm 9.5 license manager – Dynamic Host Configuration Protocol (DHCP) 3.0 • Programming Environment 1.2, which includes: – PGI 6.0 Fortran, C, and C++ compilers and tools1 – Cray MPICH2 0.97 library of MPI-2 routines – Cray SHMEM 1.0 library of distributed-memory access routines 1 28 PGI 6.0 requires the FLEXlm license manager, which controls the number of simultaneous users. S–2425–12 Release Package [5] – ACML 2.6 library of BLAS, LAPACK, and FFT routines – Cray XT3 LibSci 1.2 library of ScaLAPACK, BLACS, and SuperLU routines – Performance API (PAPI) 3.0.8 – GNU glibc 2.4.2 • CrayDoc software suite and the documentation, described in Chapter 4, page 21 • A printed copy of this release overview • A printed copy of the Cray XT3 Software Installation and Configuration Guide • A printed copy of the Cray XT3 Systems 1.2 Release Errata • CrayPat 1.1.0 (if ordered by your site) • Cray Apprentice2 2.3.1 (if ordered by your site) • PBS Pro batch subsystem 5.3.2xt release, which is based on PBS Pro version 5.3.1 from Altair Grid Technologies (if ordered by your site) 5.5 Licensing Cray licenses the following as separate products for Cray XT3 systems under a Cray license agreement: • Cray XT3 OS (which provides rights to UNICOS/lc and its components) • Cray XT3 Programming Environment (licensed by number of simultaneous users) • PBS Pro Batch Subsystem (optional product) • CrayPat Performance Collector (optional product) • Cray Apprentice2 Performance Analyzer (optional product licensed by number of simultaneous users) The PAPIlicnotices(7) and superlulicnotices(7) man pages list the license notices for the software that Cray supplies for the Cray XT3 Programming Environment in conjunction with the software and documentation copyright distribution requirements. The gnulicnotices(7) man page lists the public license notice for the GNU Free Documentation used in the UNICOS/lc release. S–2425–12 29 Cray XT3™ Systems Software Release Overview For more information about licensing and pricing, contact your Cray sales representative or send e-mail to [email protected]. Customers outside the United States and Canada must sign a Letter of Assurance before software can be shipped to them. For questions about whether you have signed this agreement, or questions about which software requires this letter, send e-mail to [email protected]. 5.6 Ordering Software This release package is distributed by order only to customers who have signed a license agreement for the Cray software that includes this product. The most current revision of the release package is supplied. To receive any upgrades to a given Cray product, the customer must also have a signed support agreement for this Cray software. You can order the release package from the Cray Software Distribution Center in any of the following ways: E-mail: [email protected] CRInform (for subscribers): Click on the Order Cray Software link. Telephone (inside U.S., Canada): 1–800–284–2729 (BUG CRAY), then 605–9100 Telephone (outside U.S., Canada): +1–651–605–9100 Fax: +1–651–605–9001 Mail: Software Distribution Center Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA Software will be shipped by ground service or 5-day international service. 30 S–2425–12 Customer Services [6] This chapter describes the customer services that support this release. 6.1 Technical Assistance with Software Problems If you experience problems with Cray software, contact your Cray service representative. Your service representative will work with you to resolve the problem. If you choose to have full- or part-time support on site, your on-site personnel are your primary contacts for service. If you have elected not to have on-site support, please call or send e-mail to the Cray Customer Support Center: E-mail: [email protected] Telephone (inside U.S., Canada): 1–800–950–2729 (CRAY) Telephone (outside U.S., Canada): +1–715–726–4993 Fax: +1–651–605–9001 S–2425–12 31 Cray XT3™ Systems Software Release Overview 32 S–2425–12 Index A Accessing Cray documentation, ACML, 6, 17, 28 AMD Core Math Library, 6 Apprentice2, 7, 27–29 21 B BLACS, 28 BLAS, 28 Books, accessing, 21 Booting the system, 7 C C compiler, 28 C++ compiler, 28 Catamount, 28 Compatibilities, 17 Compilers, 17, 28 Contact information, 31 Cray Apprentice2, 7, 27, 29 Cray RAS and Management System See CRMS Cray SHMEM, 17 Cray websites, publications, 21 Cray websites, support, 31 CrayPat, 7, 27–29 CRMS, 7, 27–28 Customer services, 31 Customer Support Center, 31 Customs, 30 D Debugger, Etnus TotalView, 25, 28 DHCP, 28 Differences, 17 Distribution Center, 30 Documentation, 21 Documentation, accessing, 21 S–2425–12 E Enhancements, 5 Errata, 22 Etnus TotalView, 7, 27 Export license, 30 F FFT, 28 File system, 6 Fortran compiler, 28 G glibc, 28 Glossary, 25 GNet, 28 GNU glibc, 28 I I/O, 6 K Kernel, 28 L LAPACK, 28 Letter of assurance, 30 LibSci, 28 Licensing, 29–30 Limitations, 22 Linux, 28 Lustre, 28 Lustre parallel file system, 6 M Man pages, 23 Man pages, accessing, 21 Man pages, third party, 24 33 Cray XT3™ Systems Software Release Overview Microkernel, 28 Modules utility, 28 MPICH2, 17, 28 MySQL, 28 R Reboot single compute node, Release package, 27–28 RPM Package Manager, 28 N New features, 5 S ScaLAPACK, 28 SeaStar, monitoring links, 7 SHMEM, 28 SuperLU, 28 Support agreement, 29–30 Support Center, 31 SUSE LINUX, 28 System dump, 7 System Management Workstation (SMW), O Operating system, Catamount, 5, 28 Operating system, Cray Linux, 5, 28 Operating system, SUSE LINUX, 28 Operating system, UNICOS/lc, 5, 28 Optional products, 27 Ordering software, 30 P PAPI, 28 Parallel file system, 6 PBS Pro, 7, 27–28 Performance analysis, 7 PGI, 6 PGI compilers, 17, 28 Portals, 6 Portals, monitoring links, 7 Pricing, 30 Problems, 31 Programming Environment, 27 Publications, 21 Publications, accessing, 21 34 7 27 T Technical support, 31 TotalView, 7, 25, 27–28 TotalView scaling, 5 U UNICOS/lc, 5, 27–28 V Version checking, 5 Y yod version checking, 5 S–2425–12