Preview only show first 10 pages with watermark. For full document please download

Managing Lustre For The Cray Linux Environment™ (cle) S–0010–40 Tm

   EMBED


Share

Transcript

TM Managing Lustre for the Cray Linux Environment™ (CLE) S–0010–40 © 2008–2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XK6, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5h, Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, The Way to Better Science, Threadstorm, and UNICOS/lc are trademarks of Cray Inc. Engenio and LSI are trademarks of LSI Corporation. InfiniBand is a trademark of the InfiniBand Trade Association. Linux is a trademark of Linus Torvalds in the U.S. and other countries. RSA is a trademark of RSA Security Inc. Lustre is a trademark of Oracle and/or its affiliates. All other trademarks are the property of their respective owners. RECORD OF REVISION S–0010–40 Published June 2011 Supports the Cray Linux Environment (CLE) 4.0 release and the System Management Workstation (SMW) 6.0 release. 3.1.03 Published March 2011 Supports the Cray Linux Environment (CLE) 3.1.UP03 update release and the System Management Workstation (SMW) 5.1.UP03 release. 3.1 Published June 2010 Supports the Cray Linux Environment (CLE) 3.1 release and the System Management Workstation (SMW) 5.1 release. 3.0 Published March 2010 Supports the Cray Linux Environment (CLE) 3.0 release and the System Management Workstation (SMW) 5.0 release. 2.2 Published July 2009 Supports general availability versions of the Cray Linux Environment (CLE) 2.2 release running on Cray XT systems. 1.0 Published November 2008 Supports Cray Linux Environment (CLE) 2.1 release running on Cray XT systems. Changes to this Document Managing Lustre for the Cray Linux Environment™ (CLE) S–0010–40 S–0010–40 Added information • The lustre_control.sh -c option, introduced in CLE 3.1.UP03, no longer requires passwordless ssh to operate correctly; this lifted requirement makes the command more available, and therefore more useful, to administrators. For more information, see Mounting Lustre Clients on page 20 or the lustre_control.sh(8) man page. • Disk quotas provide administrators with the ability to set the amount of disk space available to users and groups. To enable quotas in Lustre file systems, see Lustre User and Group Quotas on page 33 Revised information • Recovering From a Failed OST on page 30 contains revised procedures for handling a failed OST and subsequent recovery. S–0010–3103 Added information • The lustre_control.sh command has several new options. The -c option allows the mounting and unmounting of compute node Lustre clients during system operation. The -n option allows you to specify a list of nodes to receive the mount_clients and umount_clients actions, and the -m option allows you to specify a mount point. For more information, see Mounting Lustre Clients on page 20. • New section: Mounting and Unmounting Lustre Clients Manually on page 21 Revised information • Recovering From a Failed OST on page 30 contains revised procedures for handling a failed OST and subsequent recovery. S–0010–31 Added information • The NETTYPE parameter in filesystem.fs_defs supports Cray XE systems with the Gemini based system interconnection network; see Lustre Control Utilities and File System Definition Parameters on page 14. • Lustre 1.8 uses the Linux page cache to provide read-only caching of data on object storage servers (OSS). You can disable this functionality; see Disabling OSS Read Cache and Writethrough Cache on page 32. • Procedures for using Lustre control utilities to update a file system definition file when hostnames, device names, or node identifiers change; see Ensuring a File System Definition is Consistent with Cray System Configurations on page 34. • Procedures for using Lustre imperative recovery to facilitate faster recovery in the event of a Lustre failover; see Imperative Recovery on page 59. • New section: Recovering From a Failed OST on page 30. • New section: Configuring Lustre Failover for Multiple File Systems on page 58. Contents Page Introduction to Lustre on a Cray System [1] 1.1 Lustre File System Documentation 9 . . . . . . . . . . . . . . . . . . . 9 1.2 Lustre Software Components . . . . . . . . . . . . . . . . . . . . . 10 1.3 Lustre Framework . . . . . . . . . . . . . . . . . . . . . 11 . . . Lustre File System Configuration [2] 13 2.1 Configuring Lustre File Systems on a Cray System . . . . 2.2 Lustre Control Utilities and File System Definition Parameters . . . . . . . . . . . 13 . . . . . . . . . . . 14 2.3 Creating Lustre File Systems . . . . . . . . . . . . . . . . . . . . . 18 2.4 Mounting Lustre Clients . . . . . . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . 21 . . . . . . . . . . . 21 . . . . . . . . . . . 22 . 2.5 Mounting and Unmounting Lustre Clients Manually 2.6 Lustre Option for Panic on LBUG for CNL and Service Nodes 2.7 Updating the Bootimage . . . . . . . . . . . Lustre File System Management [3] 3.1 Storage and Network Information 3.1.1 Storage Devices for Lustre 3.1.2 Lustre Networking . . 23 . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . 25 3.2 Configuring Striping on Lustre File Systems 3.2.1 Configuration and Performance Trade-off for Striping 3.2.2 Overriding File System Striping Defaults 3.3 Lustre System Administration . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . 26 . . . . . . . . . . . . . . . 26 . . . . . . . . 27 3.3.1 Lustre Commands for System Administrators 3.3.2 Identifying Metadata Server (MDS) and Object Storage Targets (OST) 3.3.3 Starting and Stopping Lustre . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . . . 28 3.3.5 Recovering From a Failed OST . . . . . . . . . . . . . . . . . . . 30 . . . . . . . . . . . . . 32 3.3.4 Adding an OST . . . 3.3.6 Disabling OSS Read Cache and Writethrough Cache 3.3.7 Checking Lustre Disk Usage . . . . . . . . . . . . . . . . . . . 33 3.3.8 Lustre User and Group Quotas . . . . . . . . . . . . . . . . . . . 33 S–0010–40 5 Managing Lustre for the Cray Linux Environment™ (CLE) Page 3.3.9 Checking the Lustre File System . . . . . . . . . . . . . . . . . . 34 . . . . . . 34 3.4.1 lustre_control.sh Options Used to Verify File System Configuration . . . . . . 35 3.4.2 Updating the File System Definition File to Use Persistent Device Names . . . . . . 36 . . . . . . 39 3.4 Ensuring a File System Definition is Consistent with Cray System Configurations . 3.4.3 Examples Using the verify_config, update_config, and dump_target_devnames Options of the lustre_control.sh Script 3.5 Troubleshooting . . . . 3.5.1 Dumping Lustre Log Files . . . . . . . . . . . . . . . . . . . . 42 . . . . . . . . . . . . . . . . . . . . 42 . . . . . . . . . . . . . . . . . 42 . . . . . . . . . . . . . . . . . 43 3.5.2 Lustre Users Report ENOSPC Errors 3.5.3 File System Error Messages . . . Lustre Failover [4] 45 4.1 Lustre Failover on Cray Systems . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Node Types for Failover . . . . . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . 46 4.2 Lustre Manual Failover . 4.2.1 Configuring Manual Lustre Failover 4.2.2 Performing Manual Failover . . . . . . . . . . . . . . . . . . . 49 4.2.3 Monitoring Manual Failover . . . . . . . . . . . . . . . . . . . 50 . . . . . . . . . . . . . . . . . . . 50 . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . 53 4.3 Lustre Automatic Failover . . 4.3.1 Lustre Automatic Failover Database Tables 4.3.1.1 The filesystem Database Table . 4.3.1.2 The lustre_service Database Table 4.3.1.3 The lustre_failover Database Table 4.3.2 Backing Up SDB Table Content . . 4.3.3 Using the xtlusfoadmin Command . . . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . . . . 55 . . . . . . . . 56 4.3.4 System Startup and Shutdown when Using Automatic Lustre Failover 4.3.5 Configuring Lustre Failover for Multiple File Systems 4.3.6 Imperative Recovery . . . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . 59 Lustre Failback [5] 5.1 Lustre Failback 61 . . . . . . . . . 5.1.1 Failback in Manual and Automatic Failover . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . 61 Glossary 65 Procedures Procedure 1. Creating, formatting and starting Lustre file systems Procedure 2. Creating Lustre clients for compute nodes Procedure 3. Unmounting Lustre from compute node clients Procedure 4. Starting Lustre 6 . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . 21 . . . . . . . . . . . . 21 . . . . . . . . . . . . 27 S–0010–40 Contents Page Procedure 5. Stopping Lustre . . . . Procedure 6. Adding an Object Storage Server and Targets Procedure 7. Deactivating a failed OST and removing striped files Procedure 8. Reformatting a single OST Procedure 9. Enabling Lustre user and group quotas . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . 28 . . . . . . . . . . . 30 . . . . . . . . . . . . . . . 31 . . . . . . . . . . . . . . . 33 . . . . 37 Procedure 10. Updating the file system definition as part of a hardware or software upgrade Procedure 11. Configuring the MDS for manual failover Procedure 12. Configuring secondary OSTs for failover Procedure 13. Disabling automatic failover Procedure 14. Performing Lustre manual failover Procedure 15. Lustre startup procedures for automatic failover Procedure 16. . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . 48 . . . . . . . . . . . . . . . 49 . . . . . . . . . . . . . . . 49 . . . . . . . . . . . . 57 Lustre shutdown procedures for automatic failover . . . . . . . . . . . 57 Procedure 17. Combining failover tables for multiple file systems . . . . . . . . . . . 58 Procedure 18. Shutting down a single file system in a multiple file system configuration . . . . . 59 Procedure 19. Performing failback . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . 33 . . . . 39 . . . . 40 Example 7. Using dump_target_devnames to list the host and device names for the current file system configuration . . . . . . . . . . . . . . . . . . . . . . . . 40 Example 8. Examples Example 1. File striping . . . Example 2. Identifying MDS and OSTs Example 3. Checking the status of individual nodes Example 4. Checking Lustre disk usage Example 5. MDT device name differs from the name specified in filesystem.fs_defs Example 6. Hostname/NID for OSS differs from that listed in filesystem.fs_defs OST recovery after failover . . . . . . . . . . . . . . . . . . . . . . 50 . . . . . . . . . . . . 26 Tables Table 1. Lustre Administrative Commands Provided with CLE Table 2. filesystem SDB Table Fields Table 3. . . . . . . . . . . . . . . . . . 52 lustre_service SDB Table Fields . . . . . . . . . . . . . . . 53 Table 4. lustre_failover SDB Table Fields . . . . . . . . . . . . . . . 54 Table 5. Lustre Automatic Failover SDB Table Dump Utilities . . . . . . . . . . . . 54 Figures Figure 1. Layout of Lustre File System . . . . . . . . . . . . . . . . . 12 Figure 2. Lustre Active/Active Failover Example . . . . . . . . . . . . . . . . 48 S–0010–40 . 7 Introduction to Lustre on a Cray System [1] The Lustre file system is optional on a Cray system with the Cray Linux Environment (CLE); your storage RAID may be configured with other file systems as your site requirements dictate. Lustre is a scalable, high-performance POSIX-compliant file system that consists of software subsystems, storage, and an associated network. Lustre uses the ldiskfs file system from Oracle for back-end storage. The ldiskfs file system is an extension to the Linux ext4 file system with Oracle enhancements for Lustre. The Lustre file system consists of software subsystems, storage, and an associated network. The packages for the Lustre file system are installed during the CLE software installation. 1.1 Lustre File System Documentation This manual documents the differences in Cray's Lustre implementation and utilities specific to the Cray Linux Environment (CLE). The Lustre Operations Manual from Oracle is included as a PDF file in the release package. Additional Information about Lustre is found at the following websites: http://wiki.lustre.org http://wiki.lustre.org/index.php/Lustre_Documentation http://www.oracle.com/us/products/servers-storage/storage/storage-software/031855.htm Note: The Lustre information presented in this guide is based, in part, on documentation from Oracle. Lustre information contained in Cray publications, supersedes information found in Oracle publications. S–0010–40 9 Managing Lustre for the Cray Linux Environment™ (CLE) 1.2 Lustre Software Components The following software components of Lustre can be implemented on selected nodes of the Cray system. • Client Clients are services or programs that access the file system. On Cray systems clients are typically associated with login or compute nodes. • Object storage target (OST) An object storage target (OST) is the software interface to back-end storage volumes. There may be one or more OSTs. The OSTs handle file data and enforce security for client access. The client performs parallel I/O operations across multiple OSTs. You configure the characteristics of the OSTs during the Lustre setup. • Object storage server (OSS) An object storage server (OSS) is a node that hosts the OSTs. Each OSS node, referenced by node ID (NID), has Fibre Channel or InfiniBand connections to a RAID controller. The OST is a logical device; the OSS is the physical node. • Metadata server (MDS) The metadata server (MDS) owns and manages information about the files in the Lustre file system. It handles namespace operations such as file creation, but it does not contain any file data. It stores information about which file is located on which OSTs, how the blocks of files are striped across the OSTs, the date and time the file was modified, and so on. The MDS is consulted whenever a file is opened or closed. Because file namespace operations are done by the MDS, they do not impact operations that manipulate file data. You configure the characteristics of the MDS during the Lustre setup. • Metadata target (MDT) The metadata target (MDT) is the software interface to back-end storage volumes for the MDS and stores metadata for the Lustre file system. • Management server (MGS) The management server (MGS) controls the configuration information for all Lustre file systems running at a site. Clients and servers contact the MGS to retrieve or change configuration information. Cray installation and upgrade utilities automatically create a default Lustre configuration in which the MGS and the Meta Data Server (MDS) are co-located on a service node and share the same physical device for data storage. 10 S–0010–40 Introduction to Lustre on a Cray System [1] 1.3 Lustre Framework The system processes (Lustre components) that run on the nodes described in the previous section are referred to as Lustre services throughout this chapter. The interactions of these services make up the framework for the Lustre file system as follows: The metadata server (MDS) transforms client requests into journaled, batched metadata updates on persistent storage. The MDS can batch large numbers of requests from a single client, or it can batch large numbers of requests generated by different clients, such as when many clients are updating a single object. Once objects have been pre-created by the MDS, the Object Storage Server (OSS) handles remote procedure calls from clients and will relay the transactions to the appropriate objects. The OSS read cache uses the Linux page cache to store data on a server until it can be written. Site administrators and analysts should take this into consideration as this may impact a service node's memory requirements. For more information, see Disabling OSS Read Cache and Writethrough Cache on page 32. You configure the characteristics of the MDS, such as how files are stored across OSTs, as part of the Lustre setup. Each pair of subsystems acts according to protocol: • MDS-Client: The MDS interacts with the client for metadata handling such as the acquisition and updates of inodes, directory information, and security handling. • OST-Client: The object storage target interacts with the client for file data I/O, including the allocation of blocks, striping, and security enforcement. • MDS-OST: The MDS and OST interact to preallocate resources and perform recovery. Lustre layout is shown in Figure 1. Note: You can create multiple Lustre file systems. A given MDS, OSTs, and associated clients constitute one Lustre file system. If you want to create another instance of Lustre, you must configure it with separate disk devices. S–0010–40 11 Managing Lustre for the Cray Linux Environment™ (CLE) Figure 1. Layout of Lustre File System Compute Node User Application I/O Library Routines LNET LND System Interconnection Network LNET LND LNET LND MDS OSS RAID Storage Metadata The Lustre framework enables you to structure your file system installation to match your data transfer requirements. One MDS plus one or more OSTs make up a single instance of Lustre and are managed together. Client nodes mount the Lustre file system over the network and access files with POSIX file system semantics. Each client mounts Lustre, uses the MDS to access metadata, and performs file I/O directly through the OSTs. 12 S–0010–40 Lustre File System Configuration [2] Follow the procedures in this chapter to configure Lustre file systems on a Cray system. If you are upgrading your system from CLE 3.1 to CLE 4.0, see Installing and Configuring Cray Linux Environment (CLE) Software for upgrade instructions. 2.1 Configuring Lustre File Systems on a Cray System The CLE software includes Lustre control utilities from Cray. These utilities provide a layer of abstraction to the standard Lustre configuration and mount utilities by implementing a centralized configuration file and describing each Lustre file system in a site-specific file system definition file. The Cray Lustre control scripts use the information in the file system definition file to interface with Lustre's MountConf system and Management Server (MGS). When using the Lustre control configuration utilities, system administrators do not need to access the MGS directly. The file system definition file, filesystem.fs_defs, describes the characteristics of a file system: MDS, OST, clients, network, and storage specifications. The first task in setting up a Lustre file system on a Cray system is to create the file system definition file with values appropriate for your site. Each file system definition file represents one file system. If you have more than one Lustre file system, you must create more than one file system definition file. The generate_config.sh utility uses the information in the file system definition file to generate a Comma Separated Value (CSV) configuration file for the Lustre file system (multiple CSV files are generated if Lustre failover is enabled). The name and location of the CSV file is specified in the filesystem.fs_defs file. The lustre_control.sh utility generates the appropriate commands to manage Lustre file system operations on a CLE system. The xt.lustre.config configuration file provides additional configuration information specific to Lustre on Cray systems. By convention, the Lustre control utilities are located in /etc/opt/cray/lustre-utils on the boot node. To specify the characteristics of the Lustre file system, follow the procedure described in Procedure 1 on page 18. For example, you can define the networking at each node, the location, name and size of the Lustre components, and the mount point of the file system. S–0010–40 13 Managing Lustre for the Cray Linux Environment™ (CLE) Service node and compute node clients reference Lustre like a local file system. References to Lustre are handled transparently through the virtual file system (VFS) switch in the kernel. Lustre file systems can be mounted and unmounted with the mount_clients and umount_clients options of the Lustre control utilities. The Lustre file systems are mounted on compute node clients automatically during startup if they are included in the corresponding compute node image /etc/fstab file. Lustre file systems can also be manually mounted using the mount command. 2.2 Lustre Control Utilities and File System Definition Parameters When you use the Lustre control utilities (see Creating Lustre File Systems on page 18), the first step is to create a Lustre file system definition file (filesystem.fs_defs) for each Lustre file system on your Cray system. A sample file system definition file is provided in /etc/opt/cray/lustre-utils/sample.fs_defs on the boot node. This section describes the parameters that define your Lustre file system. The descriptions use the following conventions for node and device naming: • nodename is a host or node name using the format nidxxxxx; for example, nid00008. • device is a device path using the format /dev/disk/by-id/ID-partN where ID is the volume identifier and partN is the partition number (if applicable); for example: /dev/disk/by-id/scsi-3600a0b800026e1400000192e4b66eb97-part2 Note: If you change any of the parameters in your filesystem.fs_defs file, you will need to run the lustre_control.sh write_conf command to regenerate the Lustre configuration and apply your changes. FSNAME Unique name for the Lustre file system defined by this filesystem.fs_defs file. Limited to 8 characters. Used internally by Lustre. LUSTRE_CSV The path of the *.csv configuration file that is created by the generate_config.sh script. The default path is /etc/opt/cray/lustre-utils/${FSNAME}.config.csv. LUSNID[NID]The hostname to NID table. Not required, but if it is not set, xtprocadmin is used to generate these values, which requires the Service Database (SDB) to be up when Lustre configuration scripts are run. Setting LUSNID will improve Lustre start and stop times on larger systems. Format: nodename. The index must be the node ID of the specified node; for example, LUSNID[18]="nid00018". MDSHOST 14 The hostname of the MDS node. Format: nodename. S–0010–40 Lustre File System Configuration [2] MDSDEV MDS physical device. Format: ${MDSHOST}:device. Note: If you are configuring Lustre failover, use the failover format example in the comments. For more information, see Lustre Failover on Cray Systems on page 45. MGSDEV MGS physical device. In a default configuration, this parameter is commented out and the MGS and MDS are co-located on a service node and share the same physical device for data storage; use this parameter to designate a separate MGS device. Specify the node and physical device to start the MGS with this file system; specify the node only if another file system will start the MGS. Format: nodename[:device]. Note: The file system associated with the MGS must be started first and stopped last. OSTDEV[n] Table of OST devices. The index [n] is the OST number. The index can start at either 0 or 1. Format: nodename:device where nodename can be ${LUSNID[n]}, if LUSNID is defined. Note: If you are configuring Lustre failover, use the failover format example in the comments. For more information, see Lustre Failover on Cray Systems on page 45. MOUNTERS Hostnames for the service nodes that mount this file system, usually login nodes. Format: "host1 host2" or pdsh syntax, for example host[1-2],host5. MOUNT_POINT Path for the client mount point on service nodes specified in the MOUNTERS parameter. NETTYPE Type of underlying interconnect. Set to gni for systems with the Gemini based system interconnection network. STRIPE_SIZE Stripe size in bytes. Cray recommends a default value of 1048576 bytes (1MB.) For more information, see Configuring Striping on Lustre File Systems on page 24. STRIPE_COUNT Integer count of the default number of OSTs used for a file. Valid range is 1 to the number of OSTs. A value of -1 specifies striping across all OSTs. Cray recommends a stripe count of 2 to 4 OSTs. For more information, see Configuring Striping on Lustre File Systems on page 24. S–0010–40 15 Managing Lustre for the Cray Linux Environment™ (CLE) QUOTAOPTS Set to quotaon=ug to enable user and group quotas for the file system or to quotaon=g to enable only group quotas. For information on Lustre quotas, see the Lustre Operations Manual. AUTO_FAILOVER Set to yes to enable automatic failover when failover is configured; set to no to select manual failover. The default setting is yes. ENABLE_IMP_RECOVERY Set to yes to enable imperative recovery (explicit client notification during the failover process). The default setting is no. RECOVERY_TIME_HARD Specifies a hard recovery window timeout for failover. The server will incrementally extend its timeout up to a hard maximum of RECOVERY_TIME_HARD seconds. The default hard recovery timeout is set to 900 seconds (15 minutes). RECOVERY_TIME_SOFT Specifies a rolling recovery window timeout for failover. This value should be less than or equal to RECOVERY_TIME_HARD. Allows RECOVERY_TIME_SOFT seconds for clients to reconnect for recovery after a server crash. This timeout will incrementally extend if it is about to expire and the server is still handling new connections from recoverable clients. The default soft recovery timeout is set to 300 (5 minutes). You can modify the following parameters, however, in most cases the default value is preferred. Only experienced Lustre administrators should change these options. EXT3_JRNL_SIZE Journal size, in megabytes, on underlying ldiskfs file systems. The default value is 400. OST_MOUNTFSOPTIONS Mount options for the OSTs. The default value is extents,mballoc,errors=remount-ro. MDS_MOUNTFSOPTIONS Mount options for the MDS. This parameter is not required for the MDS and is commented out by default. CLIENT_MOUNTOPTIONS Mount options for clients such as service nodes. Option flock is required and included by default. 16 S–0010–40 Lustre File System Configuration [2] MDS_MKFSOPTIONS Options used when creating an MDS file system. These options are passed as --mkfsoptions to the mkfs.lustre utility when the file system is created. This parameter is commented out by default. OST_MKFSOPTIONS Options used when creating an OST file system. These options are passed as --mkfsoptions to the mkfs.lustre utility when the file system is created. This parameter is commented out by default. LUSTRE_FILESYS_DATA The path of the file that is used to load data into the filesystem SDB table. The default is $FSNAME.filesys.csv in the directory specified for LUSTRE_CSV. The generate_config.sh utility creates or overwrites this file with the proper failover information. LUSTRE_FAILOVER_DATA The path of the file that is used to load data into the lustre_failover SDB table. The default is ${FSNAME}.lustre_failover.csv in the directory specified for LUSTRE_CSV. The generate_config.sh utility creates or overwrites this file with the proper failover information. LUSTRE_SERVICE_DATA The path of the file that is used to load data into the lustre_service SDB table. The default is ${FSNAME}.lustre_serv.csv in the directory specified for LUSTRE_CSV. The generate_config.sh utility creates or overwrites this file with the proper failover information. SERVERMNT Directory prefix for Lustre servers to use when mounting the OSTs. The default path is /tmp/lustre. TIMEOUT Lustre timeout in seconds. The default value is 300. FSTYPE Lustre file system type. The default value is ldiskfs. PDSH Command syntax for pdsh. The default value is "pdsh -f 256 -S". VERBOSE Verbose output flag. The default setting is yes. The lustre.fs_defs(5) man page also includes this information about file system definition parameters. S–0010–40 17 Managing Lustre for the Cray Linux Environment™ (CLE) 2.3 Creating Lustre File Systems Use the Cray Lustre control utilities to configure your system to use Lustre file systems. Follow Procedure 1 on page 18 to create file system definition files, start and stop Lustre servers, and format and mount Lustre file systems. ! Caution: You must use persistent device names in the Lustre file system definition file. Non-persistent device names (for example, /dev/sdc) can change when the system is rebooted. If non-persistent names are specified in the filesystem.fs_defs file, then Lustre may try to mount the wrong devices and fail to start when the you reboot the system. If you are upgrading from a release that did not require persistent device names, you must convert to persistent device names to avoid this problem. Use the verify_config, update_config, and dump_target_devnames options with lustre_control.sh to update existing filesystem.fs_defs files to use persistent names. For more information about Lustre control utilities see the lustre_control.sh(8), generate_config.sh(8) and lustre.fs_defs(5) man pages. Procedure 1. Creating, formatting and starting Lustre file systems Follow these steps to configure, create, and start Lustre file systems. This example uses the following Lustre configuration, where IDn represents the volume identifier on the disk, for example, scsi-3600a0b800026e1400000192e4b66eb97: MDS is on nid00012, /dev/disk/by-id/IDa OST0 is on nid00018, /dev/disk/by-id/IDb OST1 is on nid00026, /dev/disk/by-id/IDc OST2 is on nid00018, /dev/disk/by-id/IDd OST3 is on nid00026, /dev/disk/by-id/IDe Login nodes are nid00008 and nid00030 1. Create a filesystem.fs_defs file for each Lustre file system you want to configure. Note: Lustre control utilities require that you name this file by using your file system name followed by .fs_defs. For example, if your file system is called filesystem, you must name the Lustre configuration definition file filesystem.fs_defs. The file system name used here must match the name used in step 3. This name does not need to match the FSNAME parameter set in the file itself. FSNAME is used internally by Lustre to uniquely identify each file system. 18 S–0010–40 Lustre File System Configuration [2] For example, to create a file system definition file for a file system called filesystem, type the following command: boot:~ # cd /etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils # cp -p \ sample.fs_defs filesystem.fs_defs 2. Edit each newly created filesystem.fs_defs file and modify the configuration parameters to define your Lustre configuration. For example: boot:/etc/opt/cray/lustre-utils # vi filesystem.fs_defs FSNAME="lus0" MDSHOST="nid00012" MDSDEV="${MDSHOST}:/dev/disk/by-id/IDa" OSTDEV[0]="nid00018:/dev/disk/by-id/IDb" OSTDEV[1]="nid00026:/dev/disk/by-id/IDc" OSTDEV[2]="nid00018:/dev/disk/by-id/IDd" OSTDEV[3]="nid00026:/dev/disk/by-id/IDe" MOUNTERS="nid00008,nid00004" MOUNT_POINT="/mnt/filesystem" STRIPE_SIZE=1048576 STRIPE_COUNT=2 Note: You must include quotes around values that contain spaces. For additional information about Lustre configuration parameters, see the lustre.fs_defs(5) man page. 3. Edit xt.lustre.config to enable the /etc/init.d/lustre startup script to start the Lustre file system at boot time. For every filesystem.fs_defs file, add the file system name to the FILESYSTEMS= line in xt.lustre.config. For example: boot:/etc/opt/cray/lustre-utils # vi xt.lustre.config FILESYSTEMS="filesystem" If you have more than one Lustre file system, include all configured file system names, separated by a space. For example: FILESYSTEMS="filesystem filesystem2" 4. Generate CSV files for your file systems. Type this command for each filesystem.fs_defs file you created in step 1. boot:/etc/opt/cray/lustre-utils # ./generate_config.sh filesystem.fs_defs no LUSNID defined in .fs_defs, gathering nids from xtprocadmin... Created Lustre config at /etc/opt/cray/lustre-utils/filesystem.config.csv No failover configuration specified. 5. Create the file system mount point on the shared root file system in the default view. Type these commands for each Lustre file system you have configured. For example, if MOUNT_POINT is set to /mnt/filesystem, you would type: boot:/etc/opt/cray/lustre-utils # xtopview default/:/ # mkdir -p /mnt/filesystem default/:/ # exit boot:/etc/opt/cray/lustre-utils # S–0010–40 19 Managing Lustre for the Cray Linux Environment™ (CLE) 6. Format the file systems and start the Lustre servers. Type these commands for each Lustre file system you have configured. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs reformat Note: This process takes a while, possibly an hour or more. 7. Mount the service node clients. Type this command for each Lustre file system you have configured. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs mount_clients 8. Verify that the Lustre clients have the Lustre file systems mounted. boot:/etc/opt/cray/lustre-utils # ssh login mount | grep lustre 12@gni:/lus0 on /mnt/filesystem type lustre (rw,flock) 9. Stop Lustre clients and servers. Type these commands for each Lustre file system you have configured. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs stop 10. Test the /etc/init.d/lustre script. Start the Lustre servers and mount the Lustre clients. boot:/etc/opt/cray/lustre-utils # /etc/init.d/lustre start 11. Verify that the Lustre clients have the Lustre file systems mounted. boot:/etc/opt/cray/lustre-utils # ssh login mount | grep lustre 12@gni:/lus0 on /mnt/filesystem type lustre (rw,flock) 12. Exit from the boot node. boot:/etc/opt/cray/lustre-utils # exit 13. Update the boot automation scripts on the SMW. Any site boot automation scripts need to be changed to reflect the new method of starting Lustre servers and mounting Lustre file systems on the clients. Type these commands, adding the line as shown. Note: You must start Lustre on the service nodes before you boot the compute nodes. smw:~# vi /opt/cray/etc/auto.xthostname lappend actions { crms_exec_on_bootnode "root" "/etc/init.d/lustre start"} 2.4 Mounting Lustre Clients Service node and compute node clients reference Lustre as a local file system. Service nodes mount Lustre by using the Lustre startup scripts before the compute nodes boot. The Lustre file systems are mounted on compute nodes automatically during startup if they are included in the corresponding /etc/fstab file. Follow Procedure 2 on page 21 to mount Lustre services on compute nodes. 20 S–0010–40 Lustre File System Configuration [2] Procedure 2. Creating Lustre clients for compute nodes Note: To make these changes for a system partition, rather than for the entire system, replace /opt/xt-images/templates with /opt/xt-images/templates-pN, where N is the partition number. 1. Edit the /etc/fstab file in the default CNL boot image template. Add an entry for each file system you have configured. For example, if FSNAME=lus0, MOUNT_POINT=/mnt/filesystem, NETTYPE=gni and MDSHOST=nid00012: smw:~# vi /opt/xt-images/templates/default/etc/fstab 12@gni:/lus0 /mnt/filesystem lustre rw,flock 0 0 2. For each Lustre file system you have configured, create the file system mount point in the default boot image template. For example, if MOUNT_POINT=/mnt/filesystem: smw:~# mkdir -p /opt/xt-images/templates/default/mnt/filesystem 3. Update the boot image to include these changes. Note: You can defer this step and update the boot image once before you finish booting the system. 2.5 Mounting and Unmounting Lustre Clients Manually While boot time mounting is handled automatically, occasionally you will find the need to mount or unmount Lustre clients while the system is running. The mount_clients and umount_clients actions of the lustre_control.sh command allow you to do this. By adding the -c option, you can mount or unmount Lustre from the compute node clients, which can prevent them from flooding the system with connection RPCs (Remote Procedure Calls) when Lustre services on an MDS or OSS node are stopped. For more flexibility, the -m and -n options allow you to specify a mount point and a list of nodes to receive the mount or unmount commands. For more information, see the lustre_control.sh(8) man page. Procedure 3. Unmounting Lustre from compute node clients • To unmount Lustre from all compute node clients, enter the following command: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients -c 2.6 Lustre Option for Panic on LBUG for CNL and Service Nodes A Lustre configuration option, panic_on_lbug, is available to control Lustre behavior when processing a fatal file system error. S–0010–40 21 Managing Lustre for the Cray Linux Environment™ (CLE) When a Lustre file system hits an unexpected data condition internally, it produces an LBUG error to guarantee overall file system data integrity. This renders the file system on the node inoperable. In some cases, an administrator wants the node to remain functional; for example, when there are dependencies such as a login node that has several other mounted file systems. However, there are also cases where the desired effect is for the LBUG to cause a node to panic. Compute nodes are good examples, because when this state is triggered by a Lustre or system problem, a compute node is essentially useless. By default, the panic_on_lbug functionality is active on both service and compute nodes. To disable it on service nodes, add the following line to /etc/modprobe.conf.local in the shared root by using xtopview on the boot node; for compute nodes, add the following line to /opt/xt-images/templates/default/etc/modprobe.conf on the SMW. options libcfs libcfs_panic_on_lbug=0 After making these changes, update the boot image by following the steps in Updating the Bootimage on page 22. 2.7 Updating the Bootimage To incorporate compute node changes to /etc/fstab and /etc/modprobe.conf, a new boot image must be created. Update the bootimage using the procedures located in either Installing and Configuring Cray Linux Environment (CLE) Software or Managing System Software for Cray XE and Cray XT Systems. 22 S–0010–40 Lustre File System Management [3] Topics covered include: Lustre storage device information, Lustre networking, basic commands, optional features, system administration topics and troubleshooting. The Lustre file system is optional, and your storage RAID may be configured with other file systems as your site requirements dictate. Applications running on CNL compute nodes can also perform I/O operations via Cray Data Virtualization Service (Cray DVS). For more information on Cray DVS, see Introduction to Cray Data Virtualization Service. 3.1 Storage and Network Information 3.1.1 Storage Devices for Lustre Direct-attached Lustre file systems access the logical units (LUNs) of storage on external RAIDs through either fibre channel or Infiniband connections on the file system nodes of the Cray system. This release supports a LUN upper limit size of 8 TB (although your storage RAID may have a smaller LUN-size restriction.) Note: Devices formerly known by the Engenio name are now referred to as LSI or LSI Engenio devices. Documentation for supported storage devices is shipped with those devices. If you need documentation for your RAID array(s), please contact your Cray service representative. 3.1.2 Lustre Networking Lustre contains a networking model known as LNET. LNET provides support of multiple networks and physical interconnections and routing between multiple networks with simple configuration. Lustre Network Drivers (LNDs) for the supported network fabrics have been created for this LNET infrastructure. Generic LND (gnilnd) is used for Lustre on Cray XE systems with Gemini based system interconnection network. There are also LNDs for use with the Infiniband protocol (o2iblnd) and TCP/IP (socklnd). Note: socklnd does not support RDMA. S–0010–40 23 Managing Lustre for the Cray Linux Environment™ (CLE) On Gemini systems, Lustre processes events using the kernel-level Generic network interface (kGNI) and the generic hardware abstraction layer (gHAL) that you specify with the NETTYPE=gni parameter of the Lustre file system definition file (filesystem.fs_defs). For more information, see Managing System Software for Cray XE and Cray XT Systems and the Lustre Operations Manual. 3.2 Configuring Striping on Lustre File Systems Striping is the process of distributing data from a single file across more than one device. To improve file system performance for a few very large files, you can stripe files across several or all OSTs. The file system default striping pattern is determined by the STRIPE_SIZE and STRIPE_COUNT parameters in the Lustre file system definition file. These parameters are defined as follows: STRIPE_COUNT The number of OSTs that each file is striped across. You can stripe across any number of OSTs, from a single OST to all available OSTs. STRIPE_SIZE The number of bytes in each stripe. This much data is written to each stripe before starting to write in the next stripe. The default is 1048576. You can also choose to override striping for individual files. See Example 1. Warning: Striping can increase the rate that data files can be read or written. However, reliability decreases as the number of stripes increases. Damage to a single OST can cause loss of data in many files. When configuring striping for Lustre file systems, Cray recommends: • Striping files across one to four OSTs. Setting stripe count value greater than 2 gives good performance for many types of jobs. For larger file systems, a larger stripe width may improve performance. • Choosing the default stripe size of 1 MB (1048576 bytes). You can increase stripe size by powers of two, but there is rarely a need to configure a stripe size of greater than 2 MB. Note: Do not choose a smaller stripe size, even for files with writes that are smaller than the stripe size. This may result in degraded I/O bandwidth. 24 S–0010–40 Lustre File System Management [3] 3.2.1 Configuration and Performance Trade-off for Striping For maximum aggregate performance, it is important to keep all OSTs occupied. Consider the following circumstances when striping your file system: • When many clients in a parallel application are each creating their own files, and where the number of clients is significantly larger than the number of OSTs, the best aggregate performance is achieved when each object is put on only a single OST. See Installing and Configuring Cray Linux Environment (CLE) Software for more information about configuring OSTs at install time. • At the other extreme, for applications where multiple processes are all writing to one large (sparse) file, it is better to stripe that single file over all of the available OSTs. Similarly, if a few processes write large files in large chunks, it is a good idea to stripe over enough OSTs to keep the OSTs busy on both the write and the read path. 3.2.2 Overriding File System Striping Defaults Each Lustre file system is built with a default stripe pattern that is specified in filesystem.fs_defs. However, users may select alternative stripe patterns for specific files or directories with the lfs setstripe command, as shown in Example 1. For more information, see the lfs(1) man page. Example 1. File striping The lfs setstripe command has the following syntax: lfs setstripe filename -s stripe_size -c stripe_count -i stripe_start This example creates the file, npf, with a 2 MB (2097152 bytes) stripe that starts on OST0 (0) and stripes over two OSTs (2). $ lfs setstripe npf -s 2097152 -c 2 -i 0 Here the -s specifies the stripe size, the -c specifies the stripe count, and the -i specifies the index of the starting OST. The first two megabytes, bytes 0 through 2097151, of npf are placed on OST0, and then the third and fourth megabytes, 2097152-4194303, are placed on OST1. The fifth and sixth megabytes are again placed on OST0 and so on. S–0010–40 25 Managing Lustre for the Cray Linux Environment™ (CLE) The following special values are defined for the lfs setstripe options: stripe_size=0 Uses the file system default for stripe size. stripe_start=-1 Uses the default behavior for setting OST values. stripe_count=0 Uses the file system default for stripe count. stripe_count=-1 Uses all OSTs. 3.3 Lustre System Administration This section provides descriptions for some Lustre system administration procedures. For complete Lustre administrative information, see the Lustre documentation referenced at the beginning of this chapter. 3.3.1 Lustre Commands for System Administrators Cray provides administrative commands that configure and maintain Lustre file systems as shown in Table 1; man pages are accessible by using the man command on your Cray system. For more information about standard Lustre system administration, see the following man pages: Lustre(7), mount(8), mkfs.lustre(8), tunefs.lustre(8), mount.lustre(8), lctl(8), and lfs(1). Table 1. Lustre Administrative Commands Provided with CLE 26 Command Function lustre_control.sh Manages Lustre file system using standard Lustre commands and a customized Lustre file system definition file. generate_config.sh Generates a file system configuration file based on information provided in the filesystem.fs_defs Lustre file system definition file. S–0010–40 Lustre File System Management [3] Command Function xtlusfoadmin Displays the contents of the lustre_failover, lustre_service, and filesystem database tables in Service Database (SDB). It is also used by the system administrator to update database fields to enable or disable automatic Lustre service failover handled by the xt-lustre-proxy daemon. xtlusfoevntsndr Sends Lustre failover imperative recovery events to start the Lustre client connection switch utility on login and compute nodes. 3.3.2 Identifying Metadata Server (MDS) and Object Storage Targets (OST) Example 2. Identifying MDS and OSTs Use the lfs check servers command to identify the OSTs and MDS for the file system. You must be root user to use this command. login: # lfs check servers If there is more than one Lustre file system, the lfs check servers command does not necessarily sort the OSTs and MDSs by file system. Example 3. Checking the status of individual nodes You can check the status of individual OSTs with the lfs osts command: login: # lfs osts You can write a recursive script to check the status of all of the nodes. 3.3.3 Starting and Stopping Lustre Procedure 4. Starting Lustre Lustre file systems are started by CLE boot automation files. You can manually start Lustre file systems using the lustre_control.sh script. Start the Lustre file system on the service nodes: 1. Start the file system using lustre_control.sh: boot:~ # cd /etc/opt/cray/lustre-utils/ boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs start S–0010–40 27 Managing Lustre for the Cray Linux Environment™ (CLE) 2. Mount the service node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs mount_clients 3. Mount the compute node clients. If the appropriate /etc/fstab entries for your Lustre file system are present in the CNL boot image, then the compute nodes will mount Lustre automatically at boot. To manually mount Lustre on compute nodes that are already booted, use the following command: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs mount_clients -c Procedure 5. Stopping Lustre Lustre file systems are stopped by CLE system boot automation files. You can manually stop Lustre file systems using the lustre_control.sh script. 1. Unmount Lustre from the compute node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients -c 2. Unmount Lustre from the service node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients 3. Stop Lustre services: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs stop For more information, see the lustre_control(8) man page. 3.3.4 Adding an OST Procedure 6. Adding an Object Storage Server and Targets You can add new Object Storage Servers and Targets or add new targets to existing servers by performing the steps in the following procedure. 1. Unmount Lustre from the compute node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients -c 2. Unmount Lustre from the service node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients 3. Stop Lustre services: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs stop 4. Update the Lustre file system definition file, /etc/opt/cray/lustre-utils/filesystem.fs_defs on the boot node. Under Lustre NIDs add the Object Storage Server: LUSNID[26]="nid00026" 28 S–0010–40 Lustre File System Management [3] 5. Add Object Storage Targets: OSTDEV[3]="nid00026:/dev/disk/by-id/IDa" OSTDEV[4]="nid00026:/dev/disk/by-id/IDb" The new numbers in this sequence must follow the pre-existing target sequence numbers. 6. Generate a new Comma Separated Value (CSV) file using the changes you have put into filesystem.fs_defs. boot:/etc/opt/cray/lustre-utils # ./generate_config.sh filesystem.fs_defs 7. Format the new targets on the OSS. Enter the following: nid00026:~ # mkfs.lustre --fsname=filesystem --ost --mgsnode=12@gni /dev/disk/by-id/IDa Note: You may wish to add other file system options to this command, such as --mkfsoptions, --index, or --failnode. For more information, see the mkfs.lustre(8) man page. 8. Write and tune the Lustre configuration using the lustre_control.sh script. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs write_conf 9. Start the Lustre file system: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs start 10. Mount Lustre on the service node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs mount_clients 11. Check that the OST is among the active targets in the file system on the login node from the Lustre mount point: login:/lustre # lfs check servers filesystem-MDT0000-mdc-ffff8100f14d7c00 filesystem-OST0001-osc-ffff8100f14d7c00 filesystem-OST0002-osc-ffff8100f14d7c00 filesystem-OST0003-osc-ffff8100f14d7c00 filesystem-OST0004-osc-ffff8100f14d7c00 active. active. active. active. active. 12. You can now test that your new target is receiving I/O by writing a file to the Lustre file system from the login node. login:/lustre/mydirectory # lfs setstripe testfile -s 0 -c -1 -i -1 login:/lustre/mydirectory # dd if=/dev/zero of=testfile bs=10485760 count=1 1+0 records in 1+0 records out 10485760 bytes (10 MB) copied, 0.026317 seconds, 398 MB/s S–0010–40 29 Managing Lustre for the Cray Linux Environment™ (CLE) Then check that the file object is stored on your new target using lfs: login:/lustre/mydirectory # lfs getstripe testfile OBDS: 0: ost0_UUID ACTIVE 1: ost1_UUID ACTIVE 2: ost2_UUID ACTIVE 3: ost3_UUID ACTIVE 4: ost4_UUID ACTIVE testfile obdidx objid objid 4 1237766 0x12e306 3 564292 0x89c44 1 437047 0x6ab37 0 720254 0xafd7e 2 487517 0x7705d group 0 0 0 0 0 13. Mount Lustre on the compute node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs mount_clients -c 3.3.5 Recovering From a Failed OST These procedures are helpful if you have an OST that has failed and is not recoverable by e2fsck. In this case, the individual OST can be reformatted and brought back into the file system. Before reformatting, you must deactivate the OST and identify and remove any striped files residing on it. Procedure 7. Deactivating a failed OST and removing striped files 1. Log in to the MDS and deactivate the failed OST in order to prevent further I/O operations on the failed device. nid00012:~ # lctl --device ostidx deactivate Note: The ostdix is displayed in the left column of the output generated by the lctl dl command. 2. Regenerate the list of Lustre devices and verify that the state for the deactivated OST is IN (inactive) and not UP. nid00008:~ # lctl dl 3. Identify the ostname for your OST by running the following command: login:~> lfs df UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 358373232 1809780 336083452 0% /lus/nid00012[MDT:0] lustre-OST0000_UUID 2306956012 1471416476 718352736 63% /lus/nid00018[OST:0] lustre-OST0001_UUID 2306956012 1315772068 873988520 57% /lus/nid00018[OST:1] The ostname will be similar to fsname-OSTxxxx_UUID. 30 S–0010–40 Lustre File System Management [3] 4. Log in to a Lustre client, such as a login node, and search for files on the failed OST. login:~> lfs find /mnt/filesystem --print --obd ostname 5. Remove (unlink or rm) any striped files on the OST before reformatting. Procedure 8. Reformatting a single OST Refer to this procedure if you have a failed OST on a Lustre file system, for example, the OST is damaged and cannot be repaired by e2fsck. You can use this procedure for an OST that is available and accessible, but you should complete Procedure 7 on page 30 to generate list of affected files and unlink or remove them prior to completing the remaining steps. 1. Unmount Lustre from the compute node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients -c 2. Unmount Lustre from the service node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients 3. Stop Lustre services: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs stop 4. Reformat the OST from the OSS node that serves it. Use values from your file system definition file for the following options: filesystem is FSNAME, nid is the NID for MDSHOST, gni is NETTYPE, ostidx is the OSTDEV index for this OST, and ostdevice is the OSTDEV device name for this OST. Note: If you have any additional OST_MKFSOPTIONS in your filesystem.fs_defs file, append them to the -J size=400 value of --mkfsoptions in the following command. nid00018:~ # mkfs.lustre --reformat --ost --fsname=filesystem --mgsnode=nid@gni --index=ostidx \ --param sys.timeout=300 --mkfsoptions="-J size=400" ostdevice 5. Regenerate the Lustre configuration logs on the servers by invoking the following command from the boot node. boot:~ # /etc/opt/cray/lustre-utils/lustre_control.sh filesystem.fs_defs write_conf 6. On the MDS node, mount the MDT device as ldiskfs, and rename the lov_objid file. nid00012:~ # mount -t ldiskfs mdtdevice /mnt nid00012:~ # mv /mnt/lov_objid /mnt/lov_objid.old nid00012:~ # umount /mnt 7. Start Lustre on the servers. S–0010–40 31 Managing Lustre for the Cray Linux Environment™ (CLE) 8. Activate the newly reformatted OST on the MDS device. a. Generate a list of all the Lustre devices with the lctl dl command. Note the device index for the OST that was reformatted in the far left column. nid00012:~ # lctl dl b. Activate the OST using the index from the previous step as ostidx. nid00012:~ # lctl --device ostidx activate c. Regenerate the list of Lustre devices and verify that the state for the activated OST is UP and not IN. nid00012:~ # lctl dl 9. Mount Lustre on the clients. 3.3.6 Disabling OSS Read Cache and Writethrough Cache This section describes several commands you can use to disable and tune OSS server cache settings; if you choose to modify these settings on your system, the modification commands must be run on each of the OSS servers. Lustre uses the Linux page cache to provide read-only caching of data on object storage servers (OSS). This strategy reduces disk access time caused by repeated reads from an OST. OSS read cache is enabled by default, but you can disable it by setting /proc parameters. For example, invoke the following on the OSS: nid00018:~ # lctl set_param obdfilter.*.read_cache_enable 0 Writethrough cache can also be disabled. This prevents file writes from ending up in the read cache. To disable writethrough cache, invoke the following on the OSS: nid00018:~ # lctl set_param obdfilter.*.writethrough_cache_enable 0 As an alternative to disabling read cache and writethrough cache, there is a new parameter, readcache_max_filesize, that can be used to specify the maximum size file that will be cached on the OSS servers. To adjust this parameter from the default value (which is 0xffffffff), run the following command: nid00018:~ # lctl set_param obdfilter.*.readcache_max_filesize=10M This command sets the value to 10 MB, so that files larger than 10MB will not be cached on the OSS servers. 32 S–0010–40 Lustre File System Management [3] 3.3.7 Checking Lustre Disk Usage Example 4. Checking Lustre disk usage From a login node, type the following command: login:~> df -t lustre Filesystem 1K-blocks 12@gni:/lus0 2302839200 Used Available Use% Mounted on 1928332 2183932948 1% /lustre You can use the lfs df command to view free space on a per-OST basis. For example: login:~> lfs df UUID mds1_UUID ost0_UUID ost1_UUID ost2_UUID ost3_UUID ost4_UUID ost5_UUID ost6_UUID ost7_UUID ost8_UUID ost9_UUID ost10_UUID ost11_UUID ost12_UUID ost13_UUID ost14_UUID 1K-blocks 958719056 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 1061446736 filesystem summary: 15921701040 8738192028 7183509012 Used Available Use% Mounted on 57816156 900902900 6% /scratch[MDT:0] 570794228 490652508 53% /scratch[OST:0] 571656852 489789884 53% /scratch[OST:1] 604100184 457346552 56% /scratch[OST:2] 604444248 457002488 56% /scratch[OST:3] 588747532 472699204 55% /scratch[OST:4] 597193036 464253700 56% /scratch[OST:5] 575854840 485591896 54% /scratch[OST:6] 576749764 484696972 54% /scratch[OST:7] 582282984 479163752 54% /scratch[OST:8] 577588324 483858412 54% /scratch[OST:9] 571413316 490033420 53% /scratch[OST:10] 574388200 487058536 54% /scratch[OST:11] 593370792 468075944 55% /scratch[OST:12] 585151932 476294804 55% /scratch[OST:13] 564455796 496990940 53% /scratch[OST:14] 54% /scratch 3.3.8 Lustre User and Group Quotas Disk quotas provide system administrators with the ability to set the amount of disk space available to users and groups. Cray Lustre utilities allow you to easily enable user and group quotas on your system. For more information on administering quotas, see the Lustre Operations Manual. Procedure 9. Enabling Lustre user and group quotas Perform the following steps to enable quotas: 1. Unmount Lustre from the clients and stop all Lustre services. Refer to Procedure 5 on page 28. 2. Edit your filesystem.fs_defs file to add or change QUOTAOPTS values. boot:/etc/opt/cray/lustre-utils # vi filesystem.fs_defs S–0010–40 33 Managing Lustre for the Cray Linux Environment™ (CLE) 3. Set the value for QUOTAOPTS to "quotaon=u" for user quotas, "quotaon=g" for group quotas, or "quotaon=ug" for user and group quotas, as in the following example: QUOTAOPTS="quotaon=ug" 4. Issue the write_conf action of lustre_control.sh to regenerate your Lustre configuration. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs write_conf Note: The write_conf action will reset any configuration parameters that were manually set with lctl. You will need to reapply any such parameters following a write_conf. 5. Start Lustre services and mount the clients. Refer to Procedure 4 on page 27. 3.3.9 Checking the Lustre File System Lustre makes use of a journaling file system for its underlying storage (OSTs and MDT). This journaling feature allows for automatic recovery of file system data following a system crash. While not normally required for recovery, the e2fsck command can be run on individual OSTs and the MDT to check integrity. An administrator should always run an e2fsck on a target associated with any file system failures before Lustre is restarted. In the case of a catastrophic failure, a special Lustre file system check utility, lfsck, is also provided. The lfsck command can be used to check coherency of the full Lustre file system. For more information, see the e2fsck(8) and lfsck(8) man pages. 3.4 Ensuring a File System Definition is Consistent with Cray System Configurations ! Caution: It is possible for device names to change upon reboot of your Cray Linux Environment (CLE) system. Host names and node identifiers (NIDs) are dynamically allocated in Cray systems running CLE and will not change otherwise. Several options within lustre_control.sh allow the administrator to prepare for hardware and software upgrades, link failover, and other dynamics one may encounter that can render the Lustre file system unusable. It is possible that host names, NIDs, and/or device names of either Lustre servers or their storage targets will reflect a configuration different than what is found in the file system definition file. 34 S–0010–40 Lustre File System Management [3] SCSI device names (/dev/sd*) are not guaranteed to be numbered the same from boot to boot. This inconsistency can cause serious problems following a reboot; the Lustre configuration specified in the Lustre file system definition file may differ from actual device names, resulting in a failure to start the file system. Because of this behavior, Cray strongly recommends that you configure persistent device names for Lustre. Cray supports and tests the /dev/disk/by-id persistent device naming conventions. The by-id names typically include a portion of the device serial number in the name; for example /dev/disk/by-id/scsi-3600a0b800026e1407000192e4b66eb97. You can use a separate udev rule to create aliases for these devices. Lustre control utilities will update a file system definition file, ensuring that the device naming remains consistent. This is also useful in the event that you upgrade hardware or software where it may affect your system configuration. There are three lustre_control.sh options that you use to do this: verify_config, update_config, and dump_target_devnames. These options provide the infrastructure to update the Lustre configuration to ensure the file system retains integrity. 3.4.1 lustre_control.sh Options Used to Verify File System Configuration The lustre_control.sh command supports these file system definition options. verify_config Compares the MDSDEV, MGSDEV, and OSTDEV definitions in the file system definition file (filesystem.fs_defs) to the configured Lustre file system and reports any differences. If failover is configured, the contents of the filesystem.fs_defs file will also be verified to match the contents of the failover tables in the SDB. The failover configuration check will be skipped if AUTO_FAILOVER is set equal to no in the filesystem.fs_defs file. You should use the verify_config option each time you boot the system. update_config Creates a new version of the filesystem.fs_defs file with host and device names updated to match the current configuration but does nothing to the original filesystem.fs_defs file. The administrator uses update_config to match a file system definition with the actual names of the Lustre target objects (MDT and OSTs) when host and device names change due to changes in software and/or S–0010–40 35 Managing Lustre for the Cray Linux Environment™ (CLE) hardware configurations. update_config generates a new filesystem.fs_defs.xxxxx file. The updates are bracketed between lines beginning with the phrases: exit *** begin and exit *** end These lines must be removed before using the new filesystem.fs_defs file as a parameter to any lustre_control.sh operations. At a minimum, the administrator should review the changes. In some cases, the administrator must choose one of a list of possible changes. Once you decide which changes should be made, you should then run generate_config.sh using the new filesystem.fs_defs file to update the Lustre CSV file. dump_target_devnames Prints the valid host/device name pairs for each Lustre target object that belongs to the file system identified by the FSNAME parameter in the filesystem.fs_defs file. For information about other lustre_control.sh options, see the lustre_control.sh(8) man page. 3.4.2 Updating the File System Definition File to Use Persistent Device Names Refer to this procedure if you are performing a hardware or software upgrade that may cause Lustre device name, host name, or NID changes. 36 S–0010–40 Lustre File System Management [3] Procedure 10. Updating the file system definition as part of a hardware or software upgrade 1. Stop the Lustre file system: boot:~ # cd /etc/opt/cray/lustre-utils/ boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs stop 2. Before upgrading any hardware or software, verify the configuration and correct any errors. Repeat these steps until there are no errors. a. Invoke the following command to verify the configuration: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs verify_config b. Edit filesystem.fs_defs and correct any errors identified in the previous step. 3. Perform the CLE software upgrade by following the instructions in Installing and Configuring Cray Linux Environment (CLE) Software. 4. Execute lustre_control.sh filesystem.fs_defs update_config to write a new file system definition file reflecting the changes in configuration. If the configuration has changed, the new file system definition file created by lustre_control.sh will show the changes by enclosing them with: exit *** begin and exit *** end The old configuration will be commented out. The following is an example of a new file system definition file with device name changes: exit *** begin 1 - REVIEW RECOMMENDED # Original definition: # OSTDEV[1]="nid00018:/dev/SDNEW nid00026:/dev/SDNEW" # Device name(s) updated. OSTDEV[1]="nid00018:/dev/disk/by-id/scsi-360001ff020021101061b48be11170500-part1 \ nid00026:/dev/disk/by-id/scsi-360001ff020021101061b48be11170500-part1" exit *** end 1 There are two options that you can place within the original filesystem.fs_defs file: DEVNAME_FORMAT and UPDATE_ALL_DEVNAMES. The DEVNAME_FORMAT option specifies the format of the device used in substitutions. The default is by-id but other supported values are name, by-label, by-uuid, and by-path. If UPDATE_ALL_DEVNAMES is set to false in the original file system definition file then update_config only substitutes the default device name for those devices whose names no longer match the ones in the original file. UPDATE_ALL_DEVNAMES is set to true by default. In this case, all device names will be replaced by the style specified with DEVNAME_FORMAT. S–0010–40 37 Managing Lustre for the Cray Linux Environment™ (CLE) 5. Review the new file system definition file for the changes made by the script. If they are correct, then remove the exit *** begin and exit *** end lines and any added comments. 6. Replace the original filesystem.fs_defs file with the new one generated in step 3. boot:/etc/opt/cray/lustre-utils # mv filesystem.fs_defs.xxxxx filesystem.fs_defs 7. (Optional) If you have configured your system for automatic Lustre failover, temporarily disable failover by setting AUTO_FAILOVER=off in the newly generated file system definition file. boot:/etc/opt/cray/lustre-utils # vi filesystem.fs_defs AUTO_FAILOVER=off 8. Verify the changes are correct by using the verify_config option. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs verify_config 9. (Optional) If you have configured your system for automatic Lustre failover, re-enable failover. boot:/etc/opt/cray/lustre-utils # vi filesystem.fs_defs AUTO_FAILOVER=on 10. Update Lustre configuration using the new file system definition file: boot:/etc/opt/cray/lustre-utils # ./generate_config.sh filesystem.fs_defs boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs write_conf 11. Start the Lustre file system: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs start 38 S–0010–40 Lustre File System Management [3] 3.4.3 Examples Using the verify_config, update_config, and dump_target_devnames Options of the lustre_control.sh Script Example 5. MDT device name differs from the name specified in filesystem.fs_defs In this example, the administrator is using verify_config to check the current system configuration against the Lustre file system definition file for the file system named filesystem. The verify_config script returns an error for MDSDEV since the failover device specified in filesystem.fs_defs is not the device found for this host/device pair. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs verify_config NID: 1 HOST: boot USER: root HOME: /root DATE: Tue May 31 19:19:08 CDT 2011 -------running pdsh -f 256 -S -R /tmp/tmp.cPNyH1Z8tE -w nid00004,nid00005,nid00026,nid00136, \ nid00137,nid00150 /opt/cray/lustre-utils/default/bin/lustre_service_utils.sh filesystem Verification starting. ERROR: MDSDEV device definition does not match configuration. Device name in \ 'nid00136:/dev/SDNEW' does not match the device labeled for this object \ (/dev/disk/by-id/scsi-3600a0b800050d88000001e3f4db80f89). ERROR: Bad failover configuration. No entry found in lustre_service table for \ MDSDEV: lst_prnid=4, lst_prdev='/dev/disk/by-id/scsi-3600a0b800050d88000001e3f4db80f89', \ lst_bknid=136, lst_bkdev='/dev/SDNEW' Verification FAILED. 2 configuration errors found in filesystem.fs_defs. [fail] output logged to /tmp/lustre_control.A0menq S–0010–40 39 Managing Lustre for the Cray Linux Environment™ (CLE) Example 6. Hostname/NID for OSS differs from that listed in filesystem.fs_defs In the following screen capture, OSTDEV[0] listed in filesystem.fs_defs is attached to a different node than originally listed. The administrator is using the update_config option of lustre_control.sh to generate a new file system definition file that resolves this disparity with the current system configuration. update_config cannot determine which node in the failover pair is used for primary or secondary services. Thus, the system administrator must make a choice between the new devices listed in the updated file system definition file. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs update_config NID: 1 HOST: boot USER: root HOME: /root DATE: Tue May 31 19:41:30 CDT 2011 -------running pdsh -f 256 -S -R /tmp/tmp.sYxFnKYGSL -w c0-0c0s0n0,c0-0c0s2n0,c0-0c0s2n1, \ c0-0c0s2n2,c0-0c0s2n3,c0-0c0s0n2,c0-0c0s0n3,c1-0c1s0n0,c1-0c1s2n0,c1-0c1s2n1, \ c1-0c1s4n0,c1-0c1s4n1,c1-0c1s6n0,c1-0c1s6n1,c1-0c1s6n2,c1-0c1s6n3,c1-0c1s4n2, \ c1-0c1s4n3,c1-0c1s2n2,c1-0c1s2n3,c1-0c1s1n2,c1-0c1s1n3,c1-0c1s0n2 \ /opt/cray/lustre-utils/default/bin/lustre_service_utils.sh filesystem Update configuration starting. Configuration update completed. New fs_defs file: filesystem.fs_defs.Fq1Uw *** 2 changes in filesystem.fs_defs.Fq1Uw require editing or review. [pass] boot:/etc/opt/cray/lustre-utils # cat filesystem.fs_defs.Fq1Uw . . exit *** begin 2 - CHOICE REQUIRED # Original definition: # OSTDEV[0]="nid00154:/dev/sdc3 nid00154:/dev/sdc3" # Select one of the following to replace the original definition: # A) OSTDEV[0]="nid00005:/dev/disk/by-id/scsi-360001ff02002110106240e9e01860000 \ nid00150:/dev/disk/by-id/scsi-360001ff02002110106240e9e01860000" # B) OSTDEV[0]="nid00150:/dev/disk/by-id/scsi-360001ff02002110106240e9e01860000 \ nid00005:/dev/disk/by-id/scsi-360001ff02002110106240e9e01860000" exit *** end 2 Example 7. Using dump_target_devnames to list the host and device names for the current file system configuration dump_target_devnames will print a table of host and device name pairs for a given Lustre file system. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs dump_target_devnames NID: 1 HOST: boot USER: root HOME: /root DATE: Tue May 31 19:51:11 CDT 2011 -------running pdsh -f 256 -S -R /tmp/tmp.7V7Va5uqEU -w c0-0c0s0n0,c0-0c0s2n0,c0-0c0s2n1, \ c0-0c0s2n2,c0-0c0s2n3,c0-0c0s0n2,c0-0c0s0n3,c1-0c1s0n0,c1-0c1s2n0,c1-0c1s2n1,c1-0c1s4n0, \ c1-0c1s4n1,c1-0c1s6n0,c1-0c1s6n1,c1-0c1s6n2,c1-0c1s6n3,c1-0c1s4n2,c1-0c1s4n3,c1-0c1s2n2, \ c1-0c1s2n3,c1-0c1s1n2,c1-0c1s1n3,c1-0c1s0n2 \ /opt/cray/lustre-utils/default/bin/lustre_service_utils.sh filesystem Configuration generated from device labels for file system filesystem MDSDEV Options: A) "nid00004:/dev/sdx" "nid00004:/dev/disk/by-label/filesystem-MDT0000" 40 S–0010–40 Lustre File System Management [3] "nid00004:/dev/disk/by-uuid/441a46a8-1045-433a-8783-35db103cd47c" "nid00004:/dev/disk/by-id/scsi-3600a0b800050d88000001e3f4db80f89" "nid00004:/dev/disk/by-path/pci-0000:02:00.0-fc-0x200500a0b850d8d5:0x0020000000000000" B) "nid00136:/dev/sdx" "nid00136:/dev/disk/by-label/filesystem-MDT0000" "nid00136:/dev/disk/by-uuid/441a46a8-1045-433a-8783-35db103cd47c" "nid00136:/dev/disk/by-id/scsi-3600a0b800050d88000001e3f4db80f89" "nid00136:/dev/disk/by-path/pci-0000:02:00.0-fc-0x200500a0b850d8d5:0x0020000000000000" MGSDEV Options: None configured. OSTDEV[0] Options: A) "nid00005:/dev/sda" "nid00005:/dev/disk/by-label/filesystem-OST0000" "nid00005:/dev/disk/by-uuid/faa462ea-292b-4344-9f7e-2bdf200ecf1f" "nid00005:/dev/disk/by-id/scsi-360001ff02002110106240e9e01860000" "nid00005:/dev/disk/by-path/pci-0000:02:00.0-fc-0x21000001ff030624:0x0000000000000000" B) "nid00150:/dev/sdf" "nid00150:/dev/disk/by-label/filesystem-OST0000" "nid00150:/dev/disk/by-uuid/faa462ea-292b-4344-9f7e-2bdf200ecf1f" "nid00150:/dev/disk/by-id/scsi-360001ff02002110106240e9e01860000" "nid00150:/dev/disk/by-path/pci-0000:02:00.1-fc-0x24000001ff030969:0x0001000000000000" OSTDEV[1] Options: A) "nid00137:/dev/sdf" "nid00137:/dev/disk/by-label/filesystem-OST0001" "nid00137:/dev/disk/by-uuid/d189ae7b-009a-4641-9e7a-323eeb419bba" "nid00137:/dev/disk/by-id/scsi-360001ff02002110106240f0901860100" "nid00137:/dev/disk/by-path/pci-0000:02:00.1-fc-0x23000001ff030969:0x0001000000000000" B) "nid00026:/dev/sda" "nid00026:/dev/disk/by-label/filesystem-OST0001" "nid00026:/dev/disk/by-uuid/d189ae7b-009a-4641-9e7a-323eeb419bba" "nid00026:/dev/disk/by-id/scsi-360001ff02002110106240f0901860100" "nid00026:/dev/disk/by-path/pci-0000:02:00.0-fc-0x22000001ff030624:0x0000000000000000" OSTDEV[2] Options: A) "nid00137:/dev/sdb" "nid00137:/dev/disk/by-label/filesystem-OST0002" "nid00137:/dev/disk/by-uuid/26893b14-7764-4df8-bcd7-13c85eeb29dc" "nid00137:/dev/disk/by-id/scsi-360001ff0200211010969116e01860500" "nid00137:/dev/disk/by-path/pci-0000:02:00.0-fc-0x23000001ff030624:0x0001000000000000" B) "nid00026:/dev/sde" "nid00026:/dev/disk/by-label/filesystem-OST0002" "nid00026:/dev/disk/by-uuid/26893b14-7764-4df8-bcd7-13c85eeb29dc" "nid00026:/dev/disk/by-id/scsi-360001ff0200211010969116e01860500" "nid00026:/dev/disk/by-path/pci-0000:02:00.1-fc-0x22000001ff030969:0x0000000000000000" OSTDEV[3] Options: A) "nid00005:/dev/sdf" "nid00005:/dev/disk/by-label/filesystem-OST0003" "nid00005:/dev/disk/by-uuid/b6ffd6f2-225d-4801-a5ff-4acf74d04d93" "nid00005:/dev/disk/by-id/scsi-360001ff020021101062410bb01860300" "nid00005:/dev/disk/by-path/pci-0000:02:00.1-fc-0x21000001ff030969:0x0001000000000000" S–0010–40 41 Managing Lustre for the Cray Linux Environment™ (CLE) B) "nid00150:/dev/sda" "nid00150:/dev/disk/by-label/filesystem-OST0003" "nid00150:/dev/disk/by-uuid/b6ffd6f2-225d-4801-a5ff-4acf74d04d93" "nid00150:/dev/disk/by-id/scsi-360001ff020021101062410bb01860300" "nid00150:/dev/disk/by-path/pci-0000:02:00.0-fc-0x24000001ff030624:0x0000000000000000" . . . [pass] boot:/etc/opt/cray/lustre-utils # 3.5 Troubleshooting The following sections help you troubleshoot some of the problems affecting your file system. 3.5.1 Dumping Lustre Log Files When Lustre encounters a problem, internal Lustre debug logs are generated on the MDS and OSS nodes in log files in the /tmp directory. These files can be dumped on both server and client nodes. Log files are named by a timestamp and PID, for example: /tmp/lustre-log-nid00135.1122323203.645. The xtdumpsys command does not collect this data automatically, and since the files reside in /tmp, they disappear on reboot. You should create a script to retrieve the dump files from all MDS and OST nodes and store them in the dump directory. The files can be collected by invoking the script at the end of the xtdumpsys_mid function in an xtdumpsys plugin file. You can also enable Lustre debug logging on compute node clients. To do this you would execute the following command on the compute nodes: # echo 1 > /proc/sys/lustre/dump_on_eviction You have to collect these logs before shutdown as they will also disappear on reboot. 3.5.2 Lustre Users Report ENOSPC Errors When any of the OSTs that make up a Lustre file system become filled, subsequent writes to the OST may fail with an ENOSPC error (errno 28). Lustre reports that the file system is full, even though there is space on other OSTs. This can be confusing to users, as the df command may report that the file system has free space available. Although new files will not be created on a full OST, write requests to existing files will fail if the write would extend the file on the full OST. Use the lfs setstripe command to place files on a specific range of OSTs to avoid this problem. You can also check the disk usage on individual OSTs by using the lfs df command. 42 S–0010–40 Lustre File System Management [3] 3.5.3 File System Error Messages Lustre errors are normally reported in both the syslog messages file and in the Cray system console log. If you see errors such as: Found inode with zero generation or link Free block count wrong Free inode count wrong Run e2fsck to ensure that the ldiskfs file structure is intact. S–0010–40 43 Managing Lustre for the Cray Linux Environment™ (CLE) 44 S–0010–40 Lustre Failover [4] 4.1 Lustre Failover on Cray Systems Failover is generally defined as a service that switches to a standby or secondary server when the primary system fails or the service is temporarily shutdown for maintenance. You can configure Lustre to automatically failover (see Lustre Automatic Failover on page 50), or you can set up failover to be done manually as described in Lustre Manual Failover on page 46. Note: To support Lustre failover, each LUN (Logical Unit) must be visible to two Lustre service nodes. For more information about setting up the hardware storage configuration for Lustre failover, contact your Cray service representative. 4.1.1 Node Types for Failover The Lustre failover configuration requires two nodes (a failover pair) that must be connected to a shared storage device. You can configure the nodes in two ways—active/active or active/passive. An active node actively serves data and a passive node idly stands by to take over in the event of a failure. Active/passive: In this configuration, only one node is actively serving data all the time. The other node takes over in the case of failure. Failover for the Lustre metadata server (MDS) is configured this way on a Cray system. For example, the active node is configured with the primary MDS, and the service is started when the node is up. The passive node is configured with the backup or secondary MDS service, but it is not started when the node is up so that the node is idling. If an active MDS fails, the secondary service on the passive node is started and takes over the service. S–0010–40 45 Managing Lustre for the Cray Linux Environment™ (CLE) Active/active: In this configuration, both nodes actively serve data all the time on different storage devices. In the case of a failure, one node takes over for the other, serving both storage devices. Failover for the Lustre object storage target (OST) is configured this way on a Cray system. For example, one active node is configured with primary OST services ost1 and ost3 and secondary services ost2 and ost4. The other active node is configured with primary OST services ost2 and ost4 and secondary services ost1 and ost3. In this case, both nodes are only serving their primary targets and failover services are not active. If an active node fails, the secondary services on the other active node are started and start serving the affected targets. Note: A separate management server (MGS) is not supported with automatic fail over. ! Caution: The physical storage must be protected from simultaneous access by two active nodes across the same partition to prevent severe data corruption. For proper protection, the failed server must be completely powered off or disconnected from the shared storage device. 4.2 Lustre Manual Failover This section summarizes how to configure Lustre for manual failover and how to failover Lustre nodes manually in the event of a failure. 4.2.1 Configuring Manual Lustre Failover Configure Lustre for manual failover using Cray Lustre control utilities to interface with the standard Lustre configuration system. For additional information about configuring Lustre for Cray systems, see Installing and Configuring Cray Linux Environment (CLE) Software. Procedure 11. Configuring the MDS for manual failover MDS failover is configured as active/passive type. The examples in this procedure configure nid00012 as the primary MDS and nid00134 as the passive failover MDS and use the following site specific parameters from the filesystem.fs_defs file: FSNAME=lus0, NETTYPE=gni and MOUNT_POINT=/mylusmnt/filesystem. For more information on the Lustre file system definition file, see the lustre.fs_defs(5) man page. 1. Edit the Lustre file system definition file (filesystem.fs_defs) for each Lustre file system to define MDSDEV for failover. # The MDS MDSHOST="nid00012" #MDSDEV="${MDSHOST}:/dev/sda1" # Failover example MDSDEV="${MDSHOST}:/dev/disk/by-id/IDa nid00134:/dev/disk/by-id/IDa" 46 S–0010–40 Lustre Failover [4] 2. After making these changes to the Lustre file system definition file, regenerate the Lustre configuration files using the generate_config.sh command. boot:/etc/opt/cray/lustre-utils # ./generate_config.sh filesystem.fs_defs 3. Save the configuration information using the write_conf action of lustre_control.sh: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs write_conf 4. Restart the Lustre file system: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs start Note: After system startup, nid00012 is the active node and has the primary MDS service up to service Lustre metadata. nid00134 is the passive node. 5. Use the mount command with the following syntax to specify both primary and secondary MDS servers: mdsnid1:mdsnid2:/FSNAME where mdsnid1:mdsnid2 are the NIDs for the active and passive MDSs and /FSNAME is the name of the Lustre file system. For example, type the following command for every node listed under MOUNTERS: login:~# mount -t lustre 12@gni:134@gni:/lus0 /mylusmnt/filesystem 6. Edit the compute node fstab entry to use the failover syntax to specify both primary and secondary MDS servers. smw:~# vi /opt/xt-images/templates/default/etc/fstab For each file system you have configured, include a line similar to the following example. 12@gni:134@gni:/lus0 /mylusmnt/filesystem lustre rw,flock 0 0 7. After making these changes, update the boot image. For more information about creating boot images, see Managing System Software for Cray XE and Cray XT Systems. S–0010–40 47 Managing Lustre for the Cray Linux Environment™ (CLE) Procedure 12. Configuring secondary OSTs for failover Figure 2 below shows a typical OST failover setup with an active/active node configuration. ost0 and ost2 are the primary and active OSTs on OSS 0; ost1 and ost3 are the primary and active OSTs on OSS 1. ost1 and ost3 are the secondary OSTs on nid00018; ost0 and ost2 are the secondary OSTs on nid00026. For example, if nid00018 were to fail, ost0 and ost2 would be served by nid00026. Figure 2. Lustre Active/Active Failover Example ost0 Primary Backup /dev/disk/by-id/IDa ost1 /dev/disk/by-id/IDb OSS 0 OSS 1 ost2 /dev/disk/by-id/IDc nid00018 ost3 nid00026 /dev/disk/by-id/IDd • Edit the filesystem.fs_defs for each Lustre file system and define the OSTDEV table to create two OSSs with two primary and two backup or secondary OSTs on each one. For example: # Table of OSTS # the "index" of the array is the OST number # the value is "hostname:device" where hostname can be ${LUSNID[IDX]} if LUSNID is defined # for failover, specify multiple host:device for an OST OSTDEV[0]="nid00018:/dev/disk/by-id/IDa nid00026:/dev/disk/by-id/IDa" OSTDEV[1]="nid00026:/dev/disk/by-id/IDb nid00018:/dev/disk/by-id/IDb" OSTDEV[2]="nid00018:/dev/disk/by-id/IDc nid00026:/dev/disk/by-id/IDc" OSTDEV[3]="nid00026:/dev/disk/by-id/IDd nid00018:/dev/disk/by-id/IDd" 48 S–0010–40 Lustre Failover [4] Procedure 13. Disabling automatic failover To enable manual failover, you must also disable automatic failover in the Lustre file system definition file. Note: You should unmount the clients and stop the Lustre file system before changing the file system definition file. 1. Set AUTO_FAILOVER equal to no in filesystem.fs_defs. 2. After making these changes to the Lustre file system definition file, regenerate the Lustre configuration files using the generate_config.sh command: boot:/etc/opt/cray/lustre-utils # ./generate_config.sh filesystem.fs_defs 3. Save the configuration information using the write_conf option of lustre_control.sh: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs write_conf 4. Start the Lustre file system: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs start 5. For more information on using these commands, see the lustre.fs_defs(5), generate_config.sh(8) and lustre_control.sh(8) man pages. 4.2.2 Performing Manual Failover If a node fails or an OST is not functional, and you have set up your system for failover, perform the following steps to initiate Lustre manual failover. Procedure 14. Performing Lustre manual failover 1. Power off the failing node. Issue the xtcli command on the SMW to power off the node. For example: smw:~ # xtcli power down -f c0-0c0s6n2 The -f option is required to make sure that the node is really powered off. Warning: Prior to starting secondary OST services you must ensure that the primary node is down, i.e., not running or without power. 2. Start the secondary OST services on the failover node. For example, create the Lustre target directory and start the secondary OST services on nid00026: nid00026:~ nid00026:~ nid00026:~ nid00026:~ S–0010–40 # # # # mkdir mkdir mount mount -p -p -t -t /tmp/lustre/lus0/ost0 /tmp/lustre/lus0/ost2 lustre /dev/disk/by-id/IDa /tmp/lustre/lus0/ost0 lustre /dev/disk/by-id/IDc /tmp/lustre/lus0/ost2 49 Managing Lustre for the Cray Linux Environment™ (CLE) The recovery process may take several minutes, depending on the number of clients. Attempting warm boots during Lustre recovery is not advisable. The recovery status is recorded in the following /proc entries: for OSS: /proc/fs/lustre/obdfilter/lus0-OST0000/recovery_status /proc/fs/lustre/obdfilter/lus0-OST0002/recovery_status for MDS: /proc/fs/lustre/mds/lus0-MDT0000/recovery_status To monitor the status, see Monitoring Manual Failover on page 50. 3. After recovery, type df or lfs df on a login node. Applications that use Lustre on login nodes should be able to continue to check if all services are working properly. If there are a large number of clients doing Lustre I/O at the time the failure occurs, the recovery time may become very long. In this situation, it might be practical to reboot the system to start over. 4.2.3 Monitoring Manual Failover On the secondary OSS, there is a directory containing a file named recovery_status for each managed OST in /proc/fs/lustre/obdfilter. Use cat or grep to view the file; the display shows if an OST is being recovered and counts down the remaining seconds. Upon completion, the status changes from RECOVERING to COMPLETE. Example 8. OST recovery after failover nid00026:/etc/opt/cray/lustre-utils # egrep 'status|time' \ /proc/fs/lustre/obdfilter/lus0-OST000[02]/recovery_status /proc/fs/lustre/obdfilter/lus0-OST0000/recovery_status:status: RECOVERING /proc/fs/lustre/obdfilter/lus0-OST0000/recovery_status:time remaining: 108 /proc/fs/lustre/obdfilter/lus0-OST0002/recovery_status:status: RECOVERING /proc/fs/lustre/obdfilter/lus0-OST0002/recovery_status:time remaining: 187 nid00026:/etc/opt/cray/lustre-utils # 4.3 Lustre Automatic Failover This section describes the framework and utilities that enable Lustre services to failover automatically in the event that the primary Lustre services fail. 50 S–0010–40 Lustre Failover [4] The automatic Lustre failover framework includes the xt-lustre-proxy process, the service database, a set of database utilities and the lustre_control.sh script. The Lustre configuration and failover states are kept in the service database (SDB). Lustre database utilities and the xt-lustre-proxy process are used in conjunction with lustre_control.sh for Lustre startup and shutdown and for failover management. The xt-lustre-proxy process is responsible for automatic Lustre failover in the event of a Lustre service failure. To enable automatic failover for a Lustre file system, set AUTO_FAILOVER=yes in the file system definition file. If automatic failover is enabled, lustre_control.sh starts an xt-lustre-proxy process on each MDS and OSS. It then monitors the health of MDS and OSS services through the Hardware Supervisory System (HSS) system. If there is a node-failure or service-failure event, HSS notifies the xt-lustre-proxy process on the secondary node to start up the backup services. The primary and secondary configuration is specified in the filesystem.fs_defs. The failover configuration is stored in the SDB for use by xt-lustre-proxy. To avoid both primary and secondary services running at the same time, the xt-lustre-proxy service on the secondary node issues a node reset command to shut down the primary node before starting the secondary services. The proxy also marks the primary node as dead in the SDB so that if the primary node is rebooted while the secondary system is still running, xt-lustre-proxy will not start on the primary node. However, this state can be queried to prevent re-starting services on the primary node and is used by the lustre_control.sh script to determine when to start Lustre services. When Lustre automatic failover is configured, the lustre_control.sh utility starts and stops the xt-lustre-proxy daemon each time Lustre services are started and stopped. lustre_control.sh uses the configuration information in the filesystem.fs_defs file to start the xt-lustre-proxy daemon with options appropriate for your configuration. Typically, xt-lustre-proxy is not used directly by system administrators. Important: If you have multiple Lustre file systems, you must follow additional procedures in Configuring Lustre Failover for Multiple File Systems on page 58. You can disable services to prevent some MDS or OST services from participating in automatic failover. See the xtlusfoadmin(8) man page and Using the xtlusfoadmin Command on page 55 for more information on enabling and disabling Lustre services. The status of Lustre automatic failover is recorded in syslog messages. S–0010–40 51 Managing Lustre for the Cray Linux Environment™ (CLE) 4.3.1 Lustre Automatic Failover Database Tables Three SDB tables are used by the Lustre failover framework to determine failover processes. generate_config.sh and lustre_control.sh create and populate the filesystem, lustre_service, and lustre_failover database tables as described in the following sections. generate_config.sh produces three Comma Separated Value (CSV) files containing failover configuration in a format for use by lustre_control.sh to populate the SDB tables. 4.3.1.1 The filesystem Database Table The filesystem table stores information about file systems. The fields in the filesystem table are shown in Table 2. For more information, see the xtfilesys2db(8) and xtdb2filesys(8) man pages. Table 2. filesystem SDB Table Fields Database Table Field Description fs_fsidx File system index. Each file system is given a unique index number based on the time of day. fs_name Character string of the internal file system name; should match the value of FSNAME as defined in the filesystem.fs_defs file. fs_type File system type. Valid values are fs_lustre or fs_other. The fs_other value is currently not used. fs_active File system status snapshot. Valid values are fs_active or fs_inactive. fs_time Timestamp when any field gets updated. Format is 'yyyy-mm-dd hh:mm:ss'. fs_conf Reserved for future use. Specify as a null value using ''. 4.3.1.2 The lustre_service Database Table The lustre_service table stores the information about Lustre services. The fields in the lustre_service table are shown in Table 3. For more information, see the xtlustreserv2db(8) and xtdb2lustreserv(8) man pages. 52 S–0010–40 Lustre Failover [4] Table 3. lustre_service SDB Table Fields Database Table Field Description lst_srvnam Name of the Lustre MDS or OST services; generated by generate_config.sh. lst_srvidx Service index. For an MDS, use a value of 0. For OSTs, use the index of the table OSTDEV[index] as defined in the filesystem.fs_defs file. Format is an integer number. lst_fsidx File system index. Format is a character string. lst_prnid Node ID (NID) of the primary node for this service. Format is an integer value. lst_prdev Primary device name, such as /dev/disk/by-id/IDa, for MDT or OST. Format is a character string. lst_bknid NID of the backup or secondary node for this service. Format is an integer value. lst_bkdev Backup or secondary device name, such as /dev/disk/by-id/IDa, for MDT or OST. Format is a character string. lst_ltype Lustre service type. Valid values are lus_type_mds or lus_type_ost. lst_failover Enables failover. Valid values are lus_fo_enable to enable the failover process and lus_fo_disable to disable the failover process. lst_time Timestamp when any field gets updated. 4.3.1.3 The lustre_failover Database Table The lustre_failover table maintains the Lustre failover states. The fields in the lustre_failover table are shown in Table 4. The generate_config.sh script creates the filesystem.lustre_failover.csv data file. This file is used by lustre_control.sh to populate the lustre_failover database table. You can change this name in the Lustre file system definition file, filesystem.fs_defs, by modifying LUSTRE_FAILOVER_DATA. For more information, see the xtlustrefailover2db(8), xtdb2lustrefailover(8) and lustre_control.sh(5) man pages. S–0010–40 53 Managing Lustre for the Cray Linux Environment™ (CLE) Table 4. lustre_failover SDB Table Fields Database Table Field Description lft_prnid NID for primary node. lft_bknid NID for backup or secondary node. A value of -1 (displays 4294967295) indicates there is no backup node for the primary node. lft_state Current state for the primary node. Valid states are lus_state_down, lus_state_up or lus_state_dead. The lus_state_dead state indicates that Lustre services on the node have failed and the services are now running on the secondary node. The services on this node should not be started until the state is changed to lus_state_down by a system administrator. lft_init_state Initial primary node state at the system boot. The state here will be copied to lft_state during system boot. Valid states are lus_state_down or lus_state_dead. For normal operations, set the state to lus_state_down. If Lustre services on this node should not be brought up, set the state to lus_state_dead. lft_time Timestamp when any field gets updated. 4.3.2 Backing Up SDB Table Content The following set of utilities can be used to dump the database entries to a data file. ! Caution: By default, these utilities will create database-formatted files named lustre_failover, lustre_serv, and filesys in the current working directory. Use the -f option to override default names. Table 5. Lustre Automatic Failover SDB Table Dump Utilities 54 Command Description xtdb2lustrefailover Dumps the lustre_failover table in the SDB to the lustre_failover data file. xtdb2lustreserv Dumps the lustre_service table in the SDB to the lustre_serv data file. xtdb2filesys Dumps the filesystem table in the SDB to the filesys data file. S–0010–40 Lustre Failover [4] 4.3.3 Using the xtlusfoadmin Command The xtlusfoadmin command can be used to modify or display fields of a given automatic Lustre failover database table. When it is used to make changes to database fields, failover operation is impacted accordingly. For example, xtlusfoadmin is used to set file systems active or inactive or to enable or disable the Lustre failover process for Lustre services. For more information, see the xtlusfoadmin(8) man page. Use the query option (--query or -q) of the xtlusfoadmin command to display the fields of a database table. For example: xtlusfoamdin xtlusfoamdin xtlusfoamdin xtlusfoamdin -q -q -q -q o s f a # # # # display display display display lustre_failover table lustre_service table filesystem table all three tables Use the following commands to modify fields in the database and impact failover operation. • To either enable or disable the failover process for the whole file system, use the activate (--activate_fs) or deactivate (--deactivate_fs) options with the xtlusfoadmin command. These options set the value of the fs_active field in the filesystem table to either fs_active or fs_inactive. For example: xtlusfoadmin -a fs_index xtlusfoadmin -d fs_index # activate # deactivate This needs to be set before the xt-lustre-proxy process starts. If set while the proxy is running, xt-lustre-proxy needs to be restarted in order to pick up the change. You should ensure that you gracefully shutdown xt-lustre-proxy before restart. You can potentially trigger a failover if it there is not a graceful shutdown. A graceful shutdown is a successful completion of the lustre_control.sh script with the stop action. • To enable or disable the failover process for a Lustre service on a specific node, use the --enable_fo_by_nid (-e) or --disable_fo_by_nid (-f) options. For example: xtlusfoadmin -e nid xtlusfoadmin -f nid • # enable # disable To enable or disable the failover process for a Lustre service by name, use the enable (--enable_fo) or disable (--disable_fo) options. These options set the value of the lst_failover field in the lustre_service table to either lus_fo_enable or lus_fo_disable. For example: xtlusfoadmin -j fs_index service_name xtlusfoadmin -k fs_index service_name S–0010–40 # enable # disable 55 Managing Lustre for the Cray Linux Environment™ (CLE) • To change the initial state of a service node, use the init_state option. This option sets the value of the lft_init_state field the in the lustre_failover table to either lus_state_down or lus_state_dead. For example: xtlusfoadmin -i nid n xtlusfoadmin -i nid d # down # dead By setting a node as dead, Lustre services should not be started on that node after a reboot. • To reinitialize the current state of a service node, use the set_state option. This option would most commonly be used during failback, following a failover. Use the set_state option to change the state of a primary node from dead to down in order to failback to the primary node. This option sets the value of the lft_state field in the lustre_failover table to either lus_state_down or lus_state_dead. xtlusfoadmin -s nid n xtlusfoadmin -s nid d # down # dead 4.3.4 System Startup and Shutdown when Using Automatic Lustre Failover Use the lustre_control.sh script to start Lustre services. lustre_control.sh starts both Lustre services and launches xt-lustre-proxy. The following failover database information will impact startup operations as indicated. • Service failover enable/disable. In the event that there is a failure, the failover disabled service does not trigger a failover process. If any services for the node have failover enabled, the failure of the service triggers the failover process. To prevent a failover process from occurring for an MDS or OSS, all the services on that node need to be disabled. Use the xtlusfoadmin command to disable a service. For example: xtlusfoadmin --disable_fo fs_index service_name To disable all services on a node, type the following command: xtlusfoadmin --disable_fo_by_nid nid • 56 Initial state. At system startup, the current state (lft_state) of each primary MDS and OSS node is changed to the initial state (lft_init_state), which is usually lft_state_down. S–0010–40 Lustre Failover [4] • Current state following an automatic failover. When you fail back the primary services from the secondary node after automatic failover, the primary node's state will be lft_state_dead and needs to be re-initialized. The xt-lustre-proxy process will need the node to be in the 'down' state to start. Use the xtlusfoadmin command to change the current state of a node to lft_state_down. For example: xtlusfoadmin --set_state nid n Procedure 15. Lustre startup procedures for automatic failover 1. Log on to the boot node. 2. Start Lustre services and xt-lustre-proxy. Type the following commands for each Lustre file system you have configured. boot:~ # cd /etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs start boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs mount_clients 3. (Optional) To mount the compute node Lustre clients at this time, enter the following command: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs mount_clients -c 4. Exit from the boot node. boot:/etc/opt/cray/lustre-utils # exit Procedure 16. Lustre shutdown procedures for automatic failover If the steps in this procedure are incorporated into local scripts, ensure that these steps are always performed together. ! Caution: The lustre_control.sh stop command gracefully shuts down the xt-lustre-proxy process, while issuing SIGTERM will also work for a graceful shutdown. Any other method of termination, such as sending a SIGKILL signal, triggers the failover process and results in a failure event delivered to the secondary node. The secondary node then issues a node reset command to shut down the primary node and starts Lustre services. 1. Log on to the boot node. 2. Unmount Lustre from the compute nodes. boot:~ # cd /etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients -c 3. Unmount Lustre from the login nodes. boot:~ # cd /etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs umount_clients S–0010–40 57 Managing Lustre for the Cray Linux Environment™ (CLE) 4. Stop xt-lustre-proxy Lustre services. Type the following command for each Lustre file system you have configured. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh filesystem.fs_defs stop 5. Exit from the boot node. boot:/etc/opt/cray/lustre-utils # exit 4.3.5 Configuring Lustre Failover for Multiple File Systems In order to support automatic Lustre failover for multiple file systems, you must resolve the following limitations: • The lustre_control.sh utility loads the failover tables for the last file system and overwrites the previous tables when executing a reformat or write_conf. Follow Procedure 17 on page 58 to resolve this issue. • The lustre_control.sh stop option terminates the xt-lustre-proxy process for the specified file system. When multiple file systems share Lustre servers, the xt-lustre-proxy process will be terminated even though other file systems are still active. If you are only shutting down a single file system, you must restart xt-lustre-proxy on the shared servers for file systems that are still active. Follow Procedure 18 on page 59 to resolve this issue. Procedure 17. Combining failover tables for multiple file systems You must follow this procedure each time .csv files change and are regenerated (by using generate_config.sh) and before you invoke lustre_control.sh with reformat or write_conf options. 1. As root on the boot node, combine the failover *.csv files. boot:~ # cd /etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils # cat myfs1.filesys.csv \ myfs2.filesys.csv > all.filesys.csv boot:/etc/opt/cray/lustre-utils # cat myfs1.lustre_serv.csv \ myfs2.lustre_serv.csv > all.lustre_serv.csv boot:/etc/opt/cray/lustre-utils # cat myfs1.lustre_failover.csv \ myfs2.lustre_failover.csv > all.lustre_failover.csv Note: You do not need to combine *.config.csv files. 2. Load the combined *.csv files to the SDB. boot:/etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils boot:/etc/opt/cray/lustre-utils # # # # xtfilesys2db -f all.filesys.csv xtlustreserv2db -f all.lustre_serv.csv xtlustrefailover2db -f all.lustre_failover.csv xtlusfoadmin -q a 3. Restart all Lustre file systems so that the xt-lustre-proxy daemon uses the new combined tables. Follow Starting and Stopping Lustre on page 27 for each file system. 58 S–0010–40 Lustre Failover [4] Procedure 18. Shutting down a single file system in a multiple file system configuration You do not need to follow this procedure to shut down all of your Lustre file systems; it is only needed to shut down a single file system while leaving other file systems active. After stopping Lustre on one file system, restart xt-lustre-proxy on the shared Lustre servers. Lustre services are still active for the file systems you did not stop, however, the xt-lustre-proxy daemon on the shared servers is terminated when you shut down a file system. In this example we will shut down myfs2 1. Unmount Lustre from the compute node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh myfs2.fs_defs umount_clients -c 2. Unmount Lustre from the service node clients: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh myfs2.fs_defs umount_clients 3. Stop Lustre services: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh myfs2.fs_defs stop 4. Restart xt-lustre-proxy on the shared Lustre servers by using lustre_control.sh. The remaining active Lustre services are not affected when xt-lustre-proxy is started. boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh myfs1.fs_defs start shared_server1,shared_server_2,.. Or boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh myfs1.fs_defs start Note: Because the MGS is still active, this script generates errors similar to the following. You may safely ignore these messages. nid00024: Failed to mount /dev/disk/by-id/scsi-360001ff020021 101061ad7c811170300-part2 on host nid00024: 17 ./lustre_config: ERROR: failed to start MGS 4.3.6 Imperative Recovery Lustre imperative recovery is a Cray Lustre feature provided with CLE that works automatically in the background to facilitate faster recovery in the event of a Lustre failover. During Lustre failover and recovery, imperative recovery utilities (if enabled) notify Lustre clients to switch server connections immediately rather than waiting for connections to the primary service to timeout. In the event of a primary server failure, the xt-lustre-proxy daemon mounts the backup or secondary services and invokes the xtlusfoevntsndr utility to notify clients that the secondary services are available. S–0010–40 59 Managing Lustre for the Cray Linux Environment™ (CLE) Imperative recovery uses Version Based Recovery (VBR) to replay transactions on the secondary servers during the failover recovery process. By using VBR along with client notification, imperative recovery can significantly improve recovery time for Lustre failover. Imperative recovery functionality is disabled by default. You enable it by setting the ENABLE_IMP_REC parameter to yes in the filesystem.fs_defs file system definition file. You can configure the window for recovery timeout by using the RECOVERY_TIME_HARD and RECOVERY_TIME_SOFT parameters in the filesystem.fs_defs file. With imperative recovery enabled, clients are allowed RECOVERY_TIME_SOFT seconds to connect to the secondary server during a recovery; by default, 300 seconds. If the server is still handling new connections from recoverable clients, this timeout incrementally extends as it is about to expire. The server extends its timeout up to a hard maximum of RECOVERY_TIME_HARD seconds; by default, 900 seconds. RECOVERY_TIME_SOFT must be less than or equal to RECOVERY_TIME_HARD. For additional information, see the xtlusfoevntsndr(8) and xt-lustre-proxy(8) man pages. 60 S–0010–40 Lustre Failback [5] Failback is the process that gracefully shuts down Lustre secondary services and restarts the primary services for a failover pair of either Metadata Servers (MDS) or Object Storage Servers (OSS). In failover, the system determines that it cannot connect to a primary MDS or Object Storage Targets (OSTs) served by an OSS. The system (or the user in the case of manual failover) then starts the secondary services for the targets residing on the affected server. The failover node continues to provide primary services for its own targets. When the primary node is functioning again, you must complete the failback process manually. The steps are slightly different for automatic failover and manual failover. The failback process for automatic failover involves the extra step of updating the Service Database (SDB). You must also re-instantiate the xt-lustre-proxy process on the recovered node. 5.1 Lustre Failback In order to use Lustre failback you should do the following: 1. Shut down failover services on the secondary node. 2. Reset the failover state in the lustre_failover table using the xtlusfoadmin command. This command changes the value from lus_state_dead to lus_state_down for the primary node. This is only necessary when you configure Lustre for automatic failover. 3. Reboot the primary service node. 4. Start Lustre services on the primary node with recovery using the lustre_control.sh script. 5.1.1 Failback in Manual and Automatic Failover Procedure 19. Performing failback In this procedure, nid00018 (ost0 - /dev/disk/by-id/IDa, ost2 - /dev/disk/by-id/IDc) and nid00026 (ost1 /dev/disk/by-id/IDb, ost3 - /dev/disk/by-id/IDd) are failover pairs. nid00018 failed and nid00026 is serving both the primary and backup OSTs. After these steps are completed, ost0 and ost2 should failback to nid00018. S–0010–40 61 Managing Lustre for the Cray Linux Environment™ (CLE) 1. Unmount the secondary OSTs from the remaining live OSS in the failover pair. In this case, ost0 and ost2 are the secondary OSTs and nid00026 is the remaining live OSS. nid00026:~ # umount /tmp/lustre/filesystem/ost0 nid00026:~ # umount /tmp/lustre/filesystem/ost2 It is acceptable if there are some messages indicating the node is unable to unload some Lustre modules. This is because they are still in use by the primary OSTs belonging to nid00026. The umount commands have to finish successfully in order to proceed. 2. Verify that ost0 and ost2 are no longer showing up in the device list. When you enter this command, the following message indicates the OSTs used: nid00026:~ # lctl dl 0 UP mgc MGC12@gni 59f5af70-8926-62b7-3c3e-180ef1a6d48e 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter mds1-OST0001 mds1-OST0001_UUID 9 5 UP obdfilter mds1-OST0003 mds1-OST0003_UUID 3 3. For an automatic failover, reset the primary node's state in the SDB. There is no need to do this in manual failover since the SDB was not involved. During a failover, the failed node was set to the lus_state_dead state in the lustre_failover table which prevents xt-lustre-proxy from executing upon reboot of the failed node. You must reset the failed node's state to the initial lus_state_down state. The following displays the current and initial states for the primary node. In this example nid00018 has failed and nid00026 now provides services for its targets: nid00026:~ # xtlusfoadmin sdb lustre_failover table PRNID BKNID STATE 12 134 lus_state_up 18 26 lus_state_dead 26 18 lus_state_up INIT_STATE lus_state_down lus_state_down lus_state_down TIME 2008-01-16 14:32:46 2008-01-16 14:37:17 2008-01-16 14:31:32 4. Reset the state using the following command: nid00026:~ # xtlusfoadmin -s 18 n lft_state in lustre_failover table was updated to lus_state_down for nid 18 Here the command option -s 18 n sets the state for the node with nid 18 to n (lus_state_down). For more information, see the xtlusfoadmin(8) man page. 5. Run xtlusfoadmin to verify that you have changed the state: nid00026:~ # xtlusfoadmin sdb lustre_failover table PRNID BKNID STATE 12 134 lus_state_up 18 26 lus_state_down 26 18 lus_state_up 62 INIT_STATE lus_state_down lus_state_down lus_state_down TIME 2008-01-16 14:32:46 2008-01-16 14:59:39 2008-01-16 14:31:32 S–0010–40 Lustre Failback [5] 6. Boot the failed node. 7. Start Lustre using lustre_control.sh from the boot node using the following command: boot:/etc/opt/cray/lustre-utils # ./lustre_control.sh \ filesystem.fs_defs start_recovery nid00018 Note: Use start_recovery instead of start to allow the Lustre recovery process to take place. 8. Check the recovery_status to see if it has completed: nid00018:~ # cat /proc/fs/lustre/obdfilter/fsname-OSTxxxx/recovery_status status: COMPLETE recovery_start: 1200517442 recovery_end: 1200517493 recovered_clients: 19 unrecovered_clients: 0 last_transno: 121155231 replayed_requests: 0 status: COMPLETE recovery_start: 1200517443 recovery_end: 1200517493 recovered_clients: 19 unrecovered_clients: 0 last_transno: 121151624 replayed_requests: 0 Use the following to check the status of MDS recovery: cat /proc/fs/lustre/mds/fsname-MDTxxxx/recovery_status 9. The failback process is finished and primary services on the failed OSS are successfully restored. Verify that by using the lctl command. nid00018:~ # lctl dl 0 UP mgc MGC12@gni 5f9c38d6-7466-216f-88c3-0b85b4e9be53 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter mds1-OST0000 mds1-OST0000_UUID 3 3 UP obdfilter mds1-OST0002 mds1-OST0002_UUID 9 S–0010–40 63 Managing Lustre for the Cray Linux Environment™ (CLE) 64 S–0010–40 Glossary CLE The operating system for Cray XE and Cray XT systems. CNL The CLE compute node kernel. CNL provides a set of supported system calls. CNL provides many of the operating system functions available through the service nodes, although some functionality has been removed to improve performance and reduce memory usage by the system. Cray DVS The Cray Data Virtualization Service (Cray DVS) is a distributed network service that provides compute nodes with transparent access to file systems on the service partition using the Cray high-speed network. Hardware Supervisory System (HSS) Hardware and software that monitors the hardware components of the system and proactively manages the health of the system. It communicates with nodes and with the management processors over the private Ethernet network. See also system interconnection network. login node The service node that provides a user interface and services for compiling and running applications. metadata server (MDS) The component of the Lustre file system that manages Metadata Targets (MDT) and handles requests for access to file system metadata residing on those targets. Modules A package on a Cray system that enables you to modify the user environment dynamically by using module files. (This term is not related to the module statement of the Fortran language; it is related to setting up the Cray system environment.) The S–0010–40 65 Managing Lustre for the Cray Linux Environment™ (CLE) user interface to this package is the module command, which provides a number of capabilities to the user including loading a module file, unloading a module file, listing which module files are loaded, determining which module files are available for use, and others. For example, the module command can be used to load a specific compiler and its associated libraries, or even a particular version of a specific compiler. node For CLE systems, the logical group of processor(s), memory, and network components that acts as a network end point on the system interconnection network. object storage server (OSS) The component of the Lustre file system that manages Object Storage Targets and handles I/O requests for access to file objects residing on those targets. object storage target (OST) The Lustre system component that represents an I/O device containing file data as file system objects. This can be any LUN, RAID array, disk, disk partition, etc. recovery window The time period in which Lustre servers wait for previously connected clients to reconnect. After the recovery window, the service will be available on either the restarted primary server or the backup server. service node A node that performs support functions for applications and system services. Service nodes run a version of SLES and perform specialized functions. There are six types of predefined service nodes: login, IO, network, boot, database, and syslog. system interconnection network The high-speed network that handles all node-to-node data transfers. System Management Workstation (SMW) The workstation that is the single point of control for system administration. See also Hardware Supervisory System (HSS). system set A group of partitions on the BootRAID (boot root, boot node swap, shared root, boot image, SDB, syslog, UFS, etc.) that make a complete, bootable system. 66 S–0010–40