Transcript
Configuring a SAS® Data Campus Using an EMC® E-Infostructure Paul Padley Program Manager SAS – UK
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Freda Cameron Strategist SAS – US
Martin Stiegler EMC Applied Technologist EMC - Europe
Just As SAS Helps You Make Sense of Your Data…
SAS Has Teamed With EMC to Help You Organize Your Data Storage for Performance, Scalability and Reliability. 2 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Presentation Topics 1. 2. 3. 4.
SAS/Direct™ for EMC Symmetrix® Review of the SAS File System EMC Enhances the SAS Environment SAS-EMC Synergies
3 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Infrastructure -- The IT Challenge Q
Delivering • The right information • To the right people • At the right time
Q
Maintaining an information delivery system
Managing growth -- especially with the e-business explosion Q
4 Copyright © 2001 , SAS Institute Inc. All rights reserved.
SAS -- The Power of a Proven Architecture. Deliver a consistent view of information in the right format to the right people at the right time. Q
Port data and applications to any platform and then scale upward as your business needs grow. Q
Stay on the cutting edge of technology through the shared expertise of SAS, leading hardware manufacturers and other strategic SAS alliances. Q
5 Copyright © 2001 , SAS Institute Inc. All rights reserved.
EMC Symmetrix® -Proven e-Infostructure Q
EMC Symmetrix provides • Information protection • Information sharing • Information management
Q
Supports mainframes and open systems
Q
Powerful software supporting data sharing
6 Copyright © 2001 , SAS Institute Inc. All rights reserved.
...Introducing …
SAS/DIRECT™ for EMC Symmetrix® QIntegrated
solution for sharing SAS data between mainframe and open system environments. QAdvantages:
7 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Q
Keep data synchronized
Q
Avoid duplicate storage
Q
Eliminate local network traffic for data transfer
SAS/DIRECT™ for EMC Symmetrix®
Mainframe system (read/write data access) Shared Symmetrix storage Open system QMaintain
a single centralized copy of the data QAuthenticate access via mainframe security QUse data channel connection to access data 8 Copyright © 2001 , SAS Institute Inc. All rights reserved.
(read data access)
SAS/DIRECT™ for EMC Symmetrix® Requirements
Mainframe system (read/write data access) Q Have
Open system Shared Symmetrix storage
(read data access)
SAS datasets created and maintained on MVS Q Use EMC Symmetrix as datastore for MVS SAS data Q Install EMC ESP and InfoMover to support shared access from multiple systems Q Open systems: HP/UX, Solaris, AIX (future), NT (future) Q Requires SAS v8.2 9 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Implementation Notes Q Uses
InfoMover technology. Q Requires mainframe and open system host to be directly attached to a shared Symmetrix. Q Accesses data directly using Symmetrix cycles. Q Initiated via standard SAS “LIBNAME” statement by a SAS user application.
10 Copyright © 2001 , SAS Institute Inc. All rights reserved.
SAS Solutions Deliver….. Open and scaleable architecture Q Rapid development of end-to-end systems Q Easy to adapt as needs change Q Integration with existing “deployment” systems and operational applications Q
11 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Review of the SAS File System
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Typical SAS Files and File Systems:
Data Sets
QTemporary
work files in /saswork or other designated location.
QPermanent
libraries in user-specified
catalogs. QMultidimensional
database (MDDB) files.
QSAS
log and list files.
QSAS
executable files.
QCustomization
or profile data in /sasuser
directory. QOther
system files.
13 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Programs SAS Log
The /saswork File System: Q The
“workhorse” of a SAS environment is the /saswork file system. data /saswork
14 Copyright © 2001 , SAS Institute Inc. All rights reserved.
“Permanent” “Scratch”
The SAS File System: SAS libraries are permanent storage areas that contain data sets and catalogs that have valid twolevel SAS names. SAS stores information in these libraries when a user issues a specific LIBNAME statement.
15 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Data Sets
Access SAS Libraries Directly, or Use SAS/SHARE Software. SAS/SHARE is a data server that allows multiple users to gain simultaneous access to SAS files.
Clients 16 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Server
Storage
MDDB Cubes An MDDB cube is a multidimensional database that stores summarized data for fast-and-easy access, saved in a permanent SAS library.
Request
17 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Reach through
SAS Log and SAS List Files Q
During program execution, SAS writes information that is peripheral to specific data files or reports to the SAS log and list files.
18 Copyright © 2001 , SAS Institute Inc. All rights reserved.
SAS Log
SAS Executable Files Q
The SAS executable files and data contain the core SAS system.
19 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Programs
Configuring System Resources Q
Place both swap and OS files on an I/O path independent from the path that /saswork uses.
data /saswork
OS files
21 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Configuring System Resources Q
The swap space should be related to SAS memsize requirements and the expected number of concurrent SAS processes.
memory
Swap space
processes 22 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Configuring System Resources The amount of memory dedicated to file caching can dramatically affect the readahead performance. Q Available memory directly affects the overall performance of a system. • Insufficient memory might cause execution errors or hard aborts in SAS programs. Q
23 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Configuring System Resources Q
System administrators usually impose constraints on swap space in the following order: Q
Available swap space memory.
Q
Process ulimit.
Q
Memsize (UNIX default 64MB)*.
Q
Sortsize (UNIX default 32MB)*.
Q
Sumsize.
*Other defaults apply to NT systems and to OS/390.
24 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Configuring System Resources Q
MVS specific: The MVS JCL parameter REGION specifies the upper limit on memory that an application can use. Q
MEMSIZE can’t be larger than REGION -- REGION RULES! Q
The MEMLEAVE option can be used to specify the amount of REGION that SAS should not use.
Q
25 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Configuring System Resources Q
MVS specific, For Sequential Access: Consider a library block size that corresponds to halftrack blocking (that is, 2 blocks per track). Q
Consider a page size for the SAS library members that is twice the library block size. Q
Q
Consider a larger BUFNO than the default..
Consider using the "In-Memory File" (IMF) feature for a SAS file which will be accessed across many SAS steps (data / procedure), if the file is small enough to fit into the available region size. Q
26 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Configuring System Resources Q
MVS specific, For Random Access: Q
Consider choosing a library block size of 6K.
Explicitly set the member page size equal to the library block size. Q
Consider using the SASFILE to load a repetitively accessed file (such as master file) into memory. Q
27 Copyright © 2001 , SAS Institute Inc. All rights reserved.
SAS Review Sufficient memory for processes and data. Q Swap file size related to memory needs and number of concurrent users. Q Increase bandwidth I/O by striping or aggregating across I/O devices. Q Separate I/O paths for OS, data, /saswork. Q
28 Copyright © 2001 , SAS Institute Inc. All rights reserved.
EMC Enhances the SAS Environment
Copyright © 2001 , SAS Institute Inc. All rights reserved.
EMC Symmetrix Systems
30 Copyright © 2001 , SAS Institute Inc. All rights reserved.
The Problem: Conflicting Goals Production
Q Q Q
Testing Reports Backup Upgrades Extractions
Negative production impact Contention for resources Negative availability impact
31 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Production
Q Q Q
Testing Reports Backup Upgrades Extractions
Parallel processing High availability Improved efficiency
Symmetrix Architecture ESCON
SCSI
Top — High Top — Low Bottom — High Bottom — Low
FIBRE
Cache DD
Q
FIBRE
DD
DD
DD
Handle I/O requests from the host to access the directory in cache and to determine if the request can be satisfied within cache.
32 Copyright © 2001 , SAS Institute Inc. All rights reserved.
EMC Symmetrix Systems Q
Q
Detect dynamically the sequential data access patterns to the disk devices by using a pre-fetch algorithm that anticipates data read requests and increases the possibility of satisfying I/O requests with data already in cache. Support connection to open UNIX, Windows 2000 and AS/400 systems with connectivity to SCSI and Fibre channels.
33 Copyright © 2001 , SAS Institute Inc. All rights reserved.
EMC PowerPath AUTOMATIC PATH MANAGEMENT FOR THE SERVER Q
The problem How do I ensure optimal performance from the server?
Q
Q
NT/UNIX Server
APPLICATION(S)
The traditional solution Purchase additional connections for the server to handle peak load
Request Request
Q
Q
MVS
Request Request
Request
Request
Request
The E-Infostructure solution: PowerPath Automatically balances load across all available connections from server
Q
NT/UNIX Server
MVS
APPLICATION(S)
Makes the server run as fast as it can!
PowerPath Request
Request
Request
Request
Request
Request
Request
Request
Symmetrix Symmetrix 34 Copyright © 2001 , SAS Institute Inc. All rights reserved.
EMC TimeFinder Q Q
Q
Q
Q Q
No host or network resources New copy can be used by another application or system (repurposed) Provides multiple copies of a single application volume Non-disruptive incremental resynchronization Protected restore Concurrent BCVs
35 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Sales
Billing
Standard volume
BCV 1
Backups Web content refresh
BCV 2
Data warehousing Application testing
Standard volume BCV 3
Symmetrix
Third-party software updates Decision support
TimeFinder Technology Q
Q
Performs an instant split across a group of devices using a single Consistent Split command
Production 1 Production 2
BCV n
SRDF links
Source Copyright © 2001 , SAS Institute Inc. All rights reserved.
BCV 2
Production n
Business continuance volume (BCV), data that you store on the Symmetrix is continuously available, even from a remote site
36
BCV 1
Target
InfoMover and the InfoMover File System (IFS) UNIX Servers
Q
InfoMover provides UNIX users with a transparent native file system interface to the IBM OS/390 catalogs and data sets.
MVS
Connectrix SOURCE FILE
Symmetrix
Q
Used with SAS/Direct, InfoMover provides SAS users who use UNIX operating systems with the ability to directly read and interpret SAS data sets created and stored by OS/390.
37 Copyright © 2001 , SAS Institute Inc. All rights reserved.
ControlCenter Q
Q
Q
Symmetrix Manager 38 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Manage both mainframe and open systems storage. View both physical and logical storage configurations to facilitate data placement and maximize volume availability. Define performance thresholds and monitor Symmetrix systems performance in real time against those thresholds
Options for Data Protection and Availability ESCON
SCSI
Top — High Top — Low Bottom — High Bottom — Low
FIBRE
FIBRE
DD
DD
Cache DD
DD
Disk mirroring. Q RAID storage protection and redundancy. Q Data caching. Q Hot spares and replacements of individual components. Q
39 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Considerations for Optimizing Performance in the Storage Pathway Q Q Q Q Q
Distributing workloads by ranking them from the busiest to the least busy. The data storage capacity required for each host connected to Symmetrix. The number of channels available from each host. The nature of the applications executed on the host connected to Symmetrix. The availability of a logical volume manager (LVM) on the host and the use of data striping.
40 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Considerations for Optimizing Performance in the Storage Pathway Q
Q
Q
The use of logical volumes on Symmetrix, logicalvolume size, and the allocation of logical volumes among different hosts, different channels and different applications. The maximum drive and file system sizes supported by each host connected to Symmetrix. Any requirements that might be needed for device sharing.
41 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Considerations for Optimizing Performance in the Storage Pathway Q
Q
Q
The number and type of channel directors and the number of ports on each director. Host-level mirroring special considerations for device distribution in Symmetrix. The possibility of upgrading Symmetrix with additional drives in the future, and its effect on the configuration if the model installed is not at maximum capacity.
42 Copyright © 2001 , SAS Institute Inc. All rights reserved.
SAS-EMC Synergies
Copyright © 2001 , SAS Institute Inc. All rights reserved.
Synergies of SAS/Direct for EMC O Keep
data synchronized. O Avoid duplicate storage. O Eliminate local network traffic for data transfer.
SAN
44 Copyright © 2001 , SAS Institute Inc. All rights reserved.
Synergies of EMC and SAS 7*24 hours data warehouse architectures. O Increase I/O performance of servers connected to an EMC e-Infostructure. O Operate on data with advanced analytics, OLAP, and data mining tools with SAS. O Fast information flow and delivery to end-user. O
45 Copyright © 2001 , SAS Institute Inc. All rights reserved.
A synergy working for you!
Resources White papers, SAS & EMC partnership: http://www.sas.com/partners/directory/emc/whitepaper.html http://www.emc.com/techlib/abstracts/sas_emc_wp.jsp SeUGI Proceedings CD
Contacts:
[email protected] [email protected] [email protected] [email protected]
47 Copyright © 2001 , SAS Institute Inc. All rights reserved.
SAS - UK SAS - US SAS –US (SAS/EMC partnership) EMC- EMEA
Configuring a Data Campus ®
Guidelines for Building a SAS Data Campus Using ® ™ an EMC E-Infostructure
Table of Contents Introduction .................................................................................................................... 1 The EMC and SAS Solution .......................................................................................... 2 What This Paper Can Tell You ...................................................................................... 2 Understanding SAS File System Characteristics ....................................................... 3 I/O Characteristics in a SAS Environment ............................................................................. 3 Basic File Types in a SAS Environment ................................................................................ 4 Temporary Work Files in /saswork .................................................................................. 4 Permanent Libraries ........................................................................................................ 5 MDDB Cubes................................................................................................................... 5 SAS Log and SAS List Files ............................................................................................ 5 SAS Executable Files ...................................................................................................... 5 Customized or Profile Data Files ..................................................................................... 6 Other System Files .......................................................................................................... 6
Adding Value with an EMC Enterprise Storage Solution........................................... 7 Overview................................................................................................................................ 7 EMC Products that Enhance the SAS Environment .............................................................. 8 PowerPath....................................................................................................................... 8 TimeFinder ...................................................................................................................... 9 InfoMover and the InfoMover File System (IFS) .............................................................. 9 EMC ControlCenter ....................................................................................................... 10
Understanding EMC Enterprise Storage ................................................................... 10 Options for Data Protection and Availability......................................................................... 10 Configuring Physical Disks .................................................................................................. 12 Optimizing Performance in the Storage Pathway ................................................................ 12
Guidelines for a SAS Data Campus on an EMC Symmetrix System ...................... 13 Summary of Guidelines for Setting Up a SAS Data Campus on an EMC Symmetrix System ............................................................................................ 15
Conclusion.................................................................................................................... 16 The Companies ............................................................................................................ 16 SAS...................................................................................................................................... 16 EMC..................................................................................................................................... 16
Glossary........................................................................................................................ 17
CONFIGURING A DATA CAMPUS
Introduction In the current information explosion, businesses recognize the value of the information they can gather. The information that drives business decisions can have many forms, including historical data, transactional data, e-mail, Web-input, Web-log data, data warehouses, survey data, and data marts.
Figure 1: A “Data Campus” Business analysis dictates that some data must be absolutely current. Other data captures historical or summary information and, therefore, has less time sensitivity. For some data you might need to create high-speed connections that give you real-time updates, while maintaining a well-marked path to other data. In addition, you want to protect the information you gather while ensuring that it is easily available, consistent, accurate and reliable. You must also consider how the software environment is designed and what impact that design has on the performance, availability and maintainability of your configuration. Defining the basic layout of the storage environment is a critical step in creating an effective configuration. Building and managing an efficient data campus is a critical component of successful solution implementations when dealing with the ever-increasing volumes of data associated with customer relationship management (CRM), supplier relationship management (SRM), enterprise performance management (EPM), and risk analysis.
1
CONFIGURING A DATA CAMPUS
The EMC and SAS Solution Today’s business computing environments must be robust enough to handle a range of processing tasks while also being scalable enough to meet increasing demands for more data, more computing power, more reliability and more complexity. Companies are amassing huge volumes of data about their businesses and customers. To meet the challenges of managing scalability and large data volumes, SAS has teamed with EMC. This partnership provides a powerful foundation for building a data campus that is robust and resilient and that provides the variety of access characteristics important in today’s information economy. SAS helps you discover the knowledge that you need from the data you collect. SAS solutions run under a wide variety of hardware environments, enabling you to choose the computing resources that match the needs of your enterprise. The enterprise computing environment creates a base for business decision making by hosting powerful analysis tools and by organizing the information needed for that analysis. EMC is an industry-recognized leader in managing information storage. EMC has a reputation for building a highly-resilient information environment, an E-Infostructure, which protects valuable information while providing flexibility as your business requirements change. You can integrate EMC’s Symmetrix information storage systems with many different computer systems to manage, protect and share your information technology (IT) infrastructure. System administrators can implement a consistent data campus across a heterogeneous computing environment to simplify data management and sharing without compromising performance or data protection.
“EMC’s alliance with SAS gives our mutual customers integrated tools that allow them to capitalize on the vast amounts of business information they gather every day.” Don Swatik Vice President of Global Alliances EMC
The collaboration between EMC and SAS provides an environment that can empower businesses with the knowledge they need to make critical decisions.
What This Paper Can Tell You This paper details specific guidelines and considerations for setting SAS file system characteristics and for taking advantage of key EMC software products, such as InfoMover and TimeFinder, to build a robust data campus. EMC storage provides SAS users with an infrastructure that can grow incrementally to deliver hundreds of terabytes of fully protected data for analysis through SAS solutions. The guidelines for creating a SAS environment apply across operating system (OS) environments, from Windows NT to UNIX to mainframes. The EMC Symmetrix information storage systems described in this paper are also supported across OS environments. Whether your host environment is a UNIX system on Sun, HP, IBM, Compaq or other hardware, an NT system, or an OS/390 environment, you can use the basic suggestions outlined in this paper.
2
CONFIGURING A DATA CAMPUS
By having an overall plan for implementing the storage environment, system administrators can enhance performance across the computing environment. Although tuning the storage and file system configuration cannot fix all performance issues, careful attention to the storage environment helps ensure that you take full advantage of the available computing resources.
Understanding SAS File System Characteristics This section highlights the general use, interaction and access of SAS components to help you decide your overall file system layout. SAS applications perform a wide variety of tasks that might create data access patterns that differ from those described here. (The section Guidelines for a SAS Data Campus on an EMC Symmetrix System later in this paper summarizes the specific recommendations for assigning storage for SAS components in a typical SAS installation.) The storage infrastructure adds value by transparently managing the availability, reliability and accessibility options in the application environment. A key point is that the storage environment houses more than just data. The programs, temporary work files, OS files and custom catalogs all share (and compete for) resources in the computing environment. Thus, you should distribute files and file systems across the storage environment to reduce resource competition as much as possible. A single configuration running SAS can support more than one SAS application concurrently. In addition, SAS applications might have different workload characteristics. So, there is no single “right way” to configure an installation. You must consider all of the characteristics of your site when making configuration decisions.
I/O Characteristics in a SAS Environment The I/O characteristics of a SAS environment are different from those usually seen in a transactional database environment. SAS typically accesses large files and often performs a single sequential scan of the data when reading data for analysis. Therefore, in a SAS environment, you should consider the following: •
Selected data is often copied from the source data file to a “scratch” file in the /saswork directory (or to a designated alternate location).
•
The bandwidth available for I/O access is usually more important to SAS performance than raw I/O-per-second measurements. This fact is an important distinction that is based on the difference in data access patterns between SAS and an online transactional processing (OLTP) system.
•
SAS data access patterns are predominantly sequential and benefit from OS read-ahead algorithms. Thus, the amount of memory dedicated to file caching can dramatically affect the read-ahead performance. On a system dedicated to SAS applications, the effectiveness of file caching is determined as a function of physical memory, the number of concurrently executing SAS processes and the memsize configuration within SAS, which controls process memory utilization as well as file cache settings. 3
CONFIGURING A DATA CAMPUS
•
Read/Write access patterns on /saswork typically approach a 50/50 mix.
•
SAS does not use asynchronous I/O. Therefore, you might find it valuable to aggregate bandwidth or to stripe file systems across independent I/O devices (disks and controllers) to increase bandwidth for I/O.
Basic File Types in a SAS Environment Typical SAS installations have the following types of files and file systems: •
Temporary work files in /saswork or other designated location.
•
Permanent libraries in user-specified catalogs.
•
Multidimensional database (MDDB) files in specialized data subsets.
•
SAS log and list files.
•
SAS executable files.
•
Customization or profile data in users’ home space.
•
Other system files.
Temporary Work Files in /saswork The “workhorse” of a SAS environment is the /saswork file system. The /saswork file system provides temporary “scratchpad” working space for SAS applications. /saswork is conceptually shared space; however, there might be times when you find it necessary to create more than one /saswork file system to reduce competition for resources. The /saswork file system houses most of the interim work files that are created during execution of a SAS program or process. A subdirectory for each process is created dynamically within /saswork. Each /saswork/“process_subdirectory” can contain a range of data objects, including a temporary copy of permanent data sets, intermediate sort or extract files, etc. SAS cleans up these interim files (deletes them) only at process completion, although the files are closed at step boundaries. Thus, the size of each /saswork/“process_subdirectory” can grow as much as four to five times the size of the base data set. Multiple SAS users can access the same SAS data set(s) concurrently, which creates multiple paths into the data that can produce conflict of access.
4
CONFIGURING A DATA CAMPUS
Permanent Libraries SAS libraries are permanent storage areas that contain data sets and catalogs that have valid two-level SAS names. SAS stores information in these libraries when a user issues a specific LIBNAME statement. Users can access SAS libraries directly, or they can use SAS/SHARE software. SAS/SHARE is a data server that allows multiple users to gain simultaneous access to SAS files. SAS/SHARE anticipates the many combinations of hardware that you might need to access your data, then locates and delivers data to meet these multiple requests. Accessing a file by using SAS/SHARE generally produces smaller, non-sequential data requests.
MDDB Cubes SAS programs can create aggregate data cubes, sometimes called an “MDDB cube” or simply a “cube.” An MDDB cube is a multidimensional database that stores summarized data for fast-and-easy access. When a SAS program creates a cube for re-use (i.e., as a data mart that SAS reads and analyzes repeatedly), you must save the cube in a permanent SAS library. When you are analyzing a cube, SAS often reads the data into temporary files in /saswork. To reduce contention between accessing the source data and performing an analysis in /saswork, you should create the SAS library and /saswork on separate I/O paths, if possible. By default, SAS creates the MDDB cube in /saswork if you do not specify a library. A user can use the LIBNAME statement to specify an MDDB cube location; however, the system administrator cannot re-direct these files or control their placement. Because an MDDB cube can be large, we recommend that you use the library specification for placement. An MDDB cube is different from a SAS data set because the cube contains embedded information, which includes sub-tables and navigational information, such as hierarchies. The reading access pattern for an MDDB cube is pseudo-random.
SAS Log and SAS List Files During program execution, SAS writes information that is peripheral to specific data files or reports to the SAS log and list files. The user can control the amount of information SAS writes to these files by using options, such as SOURCE, NOTES, SOURCE2, MPRINT and other debugging and informational options. The SAS log and SAS list files are information logs. They are not recovery logs.
SAS Executable Files The SAS executable files and data contain the core SAS System. You should follow the recommendations distributed with each SAS software release when installing SAS. 5
CONFIGURING A DATA CAMPUS
SAS components typically include: shared libraries, java applets, utility programs, test and sample SAS code, product catalogs, message files, script files, and maps (a data set that contains coordinates for maps). SAS executable files exhibit the same characteristics as other executable files and are usually memory resident when given sufficient memory allocation.
Customized or Profile Data Files By default, SAS stores configuration files that define a user’s preferences and session settings in each user’s home directory or in a bound library under the user’s userid on a multiple user storage system (OS/390) in the data set specified by /sasuser. Customization and profile data files contain information such as color settings, fonts, window placement, and transient files that support graphics and interactive sessions. Usually, these files are not typically large and, therefore, have a negligible impact on performance evaluation.
Other System Resources System resources important to SAS include swap space and the location of OS executables. Consider the following when configuring your SAS installation: •
Place both swap and OS files on an I/O path independent from the path that /saswork uses.
•
If SAS shares a computer system with a relational database management system (RDBMS) or an online transactional processing (OLTP) system, use a separate I/O path from SAS data, applications and workspaces.
•
The swap space should be related to SAS memsize requirements and the expected number of concurrent SAS processes. If memsize is large, you should consider increasing the size of the system’s swap space.
Available memory directly affects the overall performance of a system. Insufficient memory might cause execution errors or hard aborts in SAS programs. System Administrators usually impose constraints on swap space in the following order (although no internal checks require SAS parameters to be lower than ulimit): •
Available swap space memory.
•
Process ulimit.
•
Memsize (UNIX default 64MB)*.
•
Sortsize (UNIX default 32MB)*.
•
Sumsize.
*
Other defaults apply to NT systems and to OS/390.
6
CONFIGURING A DATA CAMPUS
Adding Value with an EMC Enterprise Storage Solution This section provides product descriptions and a summary of key features in the EMC Enterprise Storage solution. It also describes EMC’s underlying storage system architectural philosophy, which is based on the complementary MOSAIC:2000 hardware architecture and Enginuity Operating Environment, and summarizes the technical highlights of Symmetrix Enterprise Storage systems capabilities that benefit mutual EMC and SAS customers.
Overview A SAS application analyzing volumes of data can stress a storage environment. Thus, managing the storage layout to optimize performance and to avoid potential problems for mutual EMC and SAS customers is critical. Symmetrix systems greatly exceed the performance and response time of conventional disk storage systems because the majority of data is transferred to-and-from cache at electronic memory speeds, not at the dramatically slower speeds of physical disk devices. Director boards (those connecting to a host and those connecting to the disks) are the means by which the data interfaces with the cache. EMC designs the director boards to work in pairs so that each director connects to two buses. This dual connection ensures access to data in the unlikely event of failure of either bus. Symmetrix systems: •
Accommodate up to 32GB of cache with all read/write operations transferring data to-or-from cache.
•
Handle I/O requests from the host to access the directory in cache and to determine if the request can be satisfied within cache.
•
Detect dynamically the sequential data access patterns to the disk devices by using a prefetch algorithm that anticipates data read requests and increases the possibility of satisfying I/O requests with data already in cache. Because SAS typically reads and analyzes data in a sequential pattern, SAS queries can benefit from this read-ahead processing.
•
Support connection to open UNIX, Windows 2000 and AS/400 systems with connectivity to SCSI and Fibre Channels.
EMC’s Symmetrix systems, which are licensed for Enterprise Storage Platform (ESP) functionality, can simultaneously support mainframe and open system connections — a capability unmatched in the industry today. This connectivity allows for a single repository of data, reduces management costs and simplifies system administration. This level of Symmetrix connectivity enables real-time support of multiple hosts and host types for greater configuration flexibility and the fulfillment of EMC’s Symmetrix systems philosophy. SAS users who need to share data between OS/390 and open systems can achieve enhanced sharing by using SAS/Direct for EMC Symmetrix. With this software, users on open systems can access and process SAS data sets that are created and maintained on OS/390 systems. 7
CONFIGURING A DATA CAMPUS
Moreover, the open systems users can access data over channel connections, which provides fast access with minimal impact to the local network.
EMC Products that Enhance the SAS Environment EMC storage solutions provide an infrastructure that can manage the data used in a SAS environment. EMC supports its Symmetrix customers’ needs with software solutions that offer unique business-enhancing capabilities in a SAS environment. The sections that follow provide a technical overview of five of these EMC products: PowerPath, TimeFinder, InfoMover, EMC ControlCenter Suite and EMC ControlCenter Workload Analyzer. See www.emc.com/products/software for further technical details on the capabilities of EMC’s Symmetrix products.
PowerPath Intelligent path management is critical to achieving optimal performance when running applications that process high volumes of data. EMC PowerPath is host-based software that works with the Symmetrix system to deliver intelligent I/O path management. PowerPath enhances Symmetrix capabilities by dynamically load balancing I/O requests and, automatically, detecting and recovering from host-to-Symmetrix connection failures. PowerPath ensures that all paths are optimally used and fully protected by providing the following: •
•
•
Dynamic load balancing. ½
Dynamically configures multiple paths and optimizes path selection for optimal performance as workloads change. You do not need to pre-assign devices to specific channels; they are dynamically distributed across multiple channels based on the I/O workload.
½
Supports multiple channel access to a Symmetrix volume.
½
Improves the host’s ability to manage heavy storage loads through continuous, intelligent I/O balancing.
Automatic path failover. ½
Adds to the high-availability capabilities of Symmetrix systems by automatically re-directing data to an alternate data path if an active service path fails. This eliminates application down time.
½
Makes failovers transparent and non-disruptive to applications.
Automatic on-line recovery. ½
8
Periodically tests the failed I/O paths to determine if a path has been repaired. If a path passes the test, PowerPath returns that path to service and resumes sending I/O to it.
CONFIGURING A DATA CAMPUS
½ •
Periodically issues I/O to paths considered active to ensure that they are still working.
Consistency groups. ½
Maintains the integrity of a database distributed across multiple Symmetrix Remote Data Facility (SRDF) units by exploiting the Symmetrix awareness functionality.
See www.emc.com/products/software/powerpath.jsp?openfolder=storage_software for additional information on EMC PowerPath.
TimeFinder Ensuring that data is continuously available allows business users to access data when they need it — a logical and obvious observation. However, the reality of IT is that there are conflicting demands for data. For example, a backup and an update cannot occur against the same data at the same time. Using EMC’s TimeFinder technology enables you to create multiple, synchronized copies of data. You can use these “hot mirrors” or “point-in-time copies” for independent activities, and later, re-synchronize the independent copies of the data. Using a special data storage definition called Business Continuance Volume (BCV), data that you store on the Symmetrix is continuously available. Using BCVs provides a standard data environment to the application and the operating systems, and TimeFinder manages the synchronization and separation of the volumes. Data that is copied on BCVs can be used for backup, restoration, decision support and applications testing. You can create separate images of data for each of these purposes without impacting production copies of data. Some potential applications that use TimeFinder might include taking images of the MDDB from the BCV device for data analysis. Separate groups performing this task might each impose heavy processing demands on a single source of data. Using TimeFinder, however, allows you to create a mirror image of a single instance of data and make that data available to multiple users. See www.emc.com/products/software/timefinder.jsp?openfolder=storage_software for additional information on EMC TimeFinder.
InfoMover and the InfoMover File System (IFS) As data sources expand and additional processing power is added to support a growing IT infrastructure, the ability to share data among different operating environments becomes critical. InfoMover meets this need by using enhanced connectivity options that make data sharing possible. InfoMover and the IFS environment enable UNIX operating systems to read OS/390 files directly, eliminating the need to move data and saving valuable file transfer time. InfoMover provides UNIX users with a transparent native file system interface to the IBM OS/390 catalogs and data sets. Used with SAS/Direct, InfoMover provides SAS users who use UNIX 9
CONFIGURING A DATA CAMPUS
operating systems with the ability to directly read and interpret SAS data sets created and stored by OS/390. This access method directly reads the data over the data channel and does not require processing cycles on OS/390 after authentication of the UNIX users’ authority to access the data is established. This shared access provides the ability to have a single, centrally located repository of data accessible from both UNIX and OS/390 hosts. See www.emc.com/products/software/infomover.jsp?openfolder=storage_software for additional information on EMC InfoMover.
EMC ControlCenter EMC ControlCenter enables you to monitor, configure, control, tune and plan storage across an entire EMC Symmetrix storage network from a single management console. Using EMC ControlCenter, you can: •
Manage both mainframe and open systems storage.
•
View both physical and logical storage configurations to facilitate data placement and maximize volume availability.
•
Define performance thresholds and monitor Symmetrix systems performance in real time against those thresholds.
See www.emc.com/products/storage_management/controlcenter.jsp?openfolder=storage_management for additional information on EMC ControlCenter.
Understanding EMC Enterprise Storage The ability to consistently access reliable data in a timely manner is an achievement built on many components. Although disk drives, as a mechanical medium, might fail in any configuration, tools and processes exist to minimize the likelihood of a program error or data loss due to failure of any single disk drive. These tools include redundancy techniques, such as a redundant array of inexpensive disks (RAID), that duplicate some data and that can continue to access data even if a disk drive fails. Careful organization of the storage environment is a key step in optimizing performance of the environment. The following sections provide background on the technology and outline considerations for configuring the physical storage environment.
Options for Data Protection and Availability Because disk arrays use disks as a collective device, they can provide higher reliability than individual units. By adding front-end caches and some data access intelligence, disk arrays also help improve access speeds while maintaining reliability. When you combine operational 10
CONFIGURING A DATA CAMPUS
procedures with storage options, such as disk mirroring, you can create processes that minimize the disruption caused by necessary and repetitive tasks, such as backups and extracts from operational systems. EMC Symmetrix storage systems implement a broad range of storage protection and acceleration techniques, which include: •
Disk mirroring.
•
RAID storage protection and redundancy.
•
Data caching.
•
Hot spares and replacements of individual components.
Before laying out a file system or implementing SAS on an EMC Symmetrix system, it is important that the system administrator understands the options that EMC has available regarding the levels of data protection and availability. EMC has enhanced the RAID level definitions in each of the implementations of data protection offered for Symmetrix systems. You can implement RAID techniques in hardware or software (or both). Most sites with high data volumes choose to implement RAID storage options in hardware using disk arrays. Disk arrays typically offer additional performance and availability options beyond basic RAID techniques. Using RAID in conjunction with the EMC software that directly manages the storage environment (TimeFinder and BCV, InfoMover, and PowerPath), you can create a dynamic storage system that enables powerful options, such as "breaking" a mirror to execute a backup while continuing operations or “breaking” a mirror to extract data without interrupting an operational system. The Symmetrix system implementations of data protection exploit functionality that differentiates the EMC offering from other RAID offerings and can add significant value to mutual EMC and SAS customers. You can select the protection schemes that best serve your needs (mirroring, parity RAID, remote mirroring or SRDF, or dynamic sparing) to optimize the best relationship of availability, performance and cost for individual data sets or file systems. The options you select are configurable at the physical volume level so that you can apply different levels of protection to different file systems within the same Symmetrix system. •
Mirroring and Parity RAID techniques balance general performance and availability of data for all task-critical and business-critical applications by maintaining a duplicate copy of volumes on two disk devices. EMC Symmetrix storage systems provide flexibility in selecting and combining appropriate protection options.
•
SRDF is an enhanced version of real-time mirroring between multiple Symmetrix systems that can include remote and multiple sites.
•
Dynamic sparing reserves disk drives as standby spares. This option increases data availability without impacting performance and can be used with other data protection techniques.
11
CONFIGURING A DATA CAMPUS
Configuring Physical Disks In addition to the data protection and availability schemes that are defined in the previous section, it is important for a SAS system administrator to have a conceptual understanding of the configuration options within the Symmetrix system from a volume layout. The Symmetrix enhances disk system functionality by supporting up to 32 logical (or hyper) volumes on one physical device. These logical volumes are the actual volumes with which a host communicates, and you can add additional logical volumes if you require more capacity. The Symmetrix system supports a maximum of 4,096 logical volumes. The system administrator creates file systems on these logical volumes. Knowing the location of the logical volumes in relation to the actual physical spindle can be very important when installing and using applications that can generate large amounts of I/O, such as a SAS application, and when placing data and creating file systems. Using tools such as EMC ControlCenter Suite and the EMC ControlCenter Workload Analyzer, you can gather information about the busy volumes and disks, and use this detailed information to identify performance and configuration issues that, otherwise, might have gone undetected.
Optimizing Performance in the Storage Pathway Achieving optimal performance requires careful and detailed planning of the Symmetrix configuration according to the requirements of the host(s) you are connecting to and your performance needs. Consider carefully the following issues with your EMC representative before the EMC Customer Engineer installs the Symmetrix system.
12
•
Distributing workloads by ranking them from the busiest to the least busy.
•
The data storage capacity required for each host connected to Symmetrix.
•
The number of channels available from each host.
•
The nature of the applications executed on the host connected to Symmetrix.
•
The availability of a Logical Volume Manager (LVM) on the host and the use of data striping.
•
The use of logical volumes on Symmetrix, logical-volume size, and the allocation of logical volumes among different hosts, different channels and different applications.
•
The maximum drive and file system sizes supported by each host connected to Symmetrix.
•
Any requirements that might be needed for device sharing.
•
The number and type of channel directors and the number of ports on each director.
•
Host-level mirroring special considerations for device distribution in Symmetrix.
CONFIGURING A DATA CAMPUS
•
The possibility of upgrading Symmetrix with additional drives in the future, and its effect on the configuration if the model installed is not at maximum capacity.
See www.emc.com for additional information about the performance characteristics, architecture and software available for the EMC Symmetrix Enterprise Storage system.
Guidelines for a SAS Data Campus on an EMC Symmetrix System We’ve based these general guidelines for designing a SAS data campus by observing and testing them at various SAS installations. Because each user’s interaction with SAS is unique, you might need to adjust these guidelines to meet the specific requirements at your site. Each SAS product includes installation instructions. The following guidelines complement the notes for specific products and provide some general information for creating the infrastructure of your SAS environment. System Administrators need to monitor any installation and adjust these guidelines to meet the needs of users while maintaining optimal performance and evaluating cost tradeoffs. 1.
Create a /saswork file system. All SAS users at your site will share this file system unless you use multiple /saswork file systems. The EMC ControlCenter Resource View software provides a view into capacity and resource management of this and all other file systems on the Symmetrix. 1.1. Stripe across multiple host bus adapters (HBAs) and disks. Use PowerPath to ensure optimal path utilization. 1.2. Use a default stripe size (extent size) of 64K blocks, 8K fragments in the UNIX file system environment. Because of the read-ahead impact, reduce the block size as you add more spindles within a RAID stripe. 1.3. Each SAS process can create demands on temporary space within /saswork. Total size of the /saswork file system should be four to five times as large as the largest data set. If you are running multiple processes concurrently, you should use the total number of concurrently open data sets as the base for this calculation.
2.
Consider creating multiple /saswork file systems. 2.1. If you observe I/O wait states without corresponding “device busy” flags, then you have potentially saturated the defined I/O channels. You might discover this problem through routine monitoring or through examination based upon performance complaints. If the I/O wait states are associated with “device busy” flags, adding more resource into the stripe (that is, disk spindles) might help improve overall performance. The EMC ControlCenter Workload Analyzer can identify detailed performance statistics on the Symmetrix disk array, such as busy devices or channel utilization, as well as host-specific performance data. 13
CONFIGURING A DATA CAMPUS
2.2. If you have preferred workloads and want to guarantee that specific SAS programs do not compete for resources, you might want to partition the SAS work areas by creating multiple /sasworkn file systems. This is strictly a system administration issue. If you create multiple /sasworkn file systems, you should allocate disk resources to accommodate four to five times the size of the largest data set that will be accessed in that work area. Dividing the space allocated to /saswork among the multiple /sasworkn file systems is not appropriate for allocating disk resources. Using EMC ControlCenter Resource View, administrators can configure file systems across different physical spindles and avoid disk contention. EMC ControlCenter Resource View provides a front-to-back view of logical volumes and file systems all the way to the actual physical disks. 2.3. Continue to stripe each /saswork work area across adapters and controllers. Don’t try to isolate a file system to a single controller. PowerPath ensures optimal path or channel utilization. If you increase the number of devices within a stripe, you might also consider reducing the stripe size. Reducing the stripe size might increase the I/O rate, but it might also reduce throughput. Therefore, you should consider the balance of these two objectives before reducing the stripe size. You can monitor I/O rate and throughput through the EMC ControlCenter Workload Analyzer. 3.
Avoid using the default /var/tmp for /saswork. If you do not specify a file system volume for /saswork, its location defaults to /var/tmp. We do not recommend /var/tmp as a location for the /saswork file system because usually, System Administrators allocate a small work size to /var/tmp when creating it. Also, some OS environments default /var/tmp to be a memory mapped (mmap) file system. Because the /saswork file system can be very large, it is usually inappropriate to use the mmap attribute for /saswork.
14
CONFIGURING A DATA CAMPUS
Summary of Guidelines for Setting Up a SAS Data Campus on an EMC Symmetrix System File Type
General Info
Comments
/sasuser
Defaults to ~HOME/sasuser.
•
Transient files that support graphics and interactive sessions are primary objects.
•
NAS storage is OK.
No default. Must specify the location.
•
Contains data sets and catalogs that have valid two-level SAS names.
Might want to set up one or more library areas.
•
Access is often “write-once, read-many.”
•
Should be on a separate I/O path from /saswork, if possible.
•
Consider RAID S.
•
Contains data sets and catalogs that have valid two-level (qualified) names.
•
I/O access is typically NOT large sequential blocks, because shared requests are interleaved.
•
Should be on a separate I/O path from /saswork.
•
If the primary pattern is that files are written one time (that is, at creation), consider RAID S.
•
If there are frequent (that is, transactional) updates, consider RAID 0/1. An aggregate data set. Usually created in a permanent library (for re-use).
Permanent SAS libraries (catalogs) — without SAS/SHARE
Permanent SAS libraries (catalogs) — with SAS/SHARE
Files are opened through a proxy (to allow shared read/write access).
MDDB files
Defaults to the default library directory, which is /saswork.
• •
Consider RAID S.
SAS log and list files
Defaults to the directory from which the user invokes SAS.
•
These are NOT recovery logs.
•
User can control the amount of information written by setting options, such as SOURCE, NOTES, SOURCE2, MPRINT, and other debugging and informational options.
•
Consider RAID S.
•
Program files can be reloaded from disk.
•
Backup SAS configuration files separately as part of system administration and setup.
•
Follow OS guidelines for executable files when establishing file system characteristics.
•
Keep physically separate from /saswork and SAS executables whenever possible.
•
Ensure sufficient memory is allocated to swap space
•
RAID options not recommended
SAS executable /installed files
Defaults to c:program files for windows No default for UNIX or OS/390.
OS Swap file
Default swap space
15
CONFIGURING A DATA CAMPUS
Conclusion Today’s business environments create an increasing demand for a reliable, accessible data campus. SAS solutions help you harness your data to create a foundation for effective business decisions, and EMC Symmetrix information storage systems make the data available when you need it. EMC and SAS — a synergy working for you!
The Companies
SAS SAS is the leader in decision support and data warehousing, providing integrated enterprise information-delivery solutions and e-business solutions. The company markets packaged business solutions for vertical industry and departmental applications, as well as an integrated suite of software tools and consulting services. These allow companies to transform the wide variety of data within their organizations into information that business users and researchers need to make better decisions. See www.sas.com for more information about.
EMC EMC Corporation (NYSE: EMC) is the world leader in information storage systems, software, networks and services, providing the information infrastructure for a connected world. See www.EMC.com for more Information about EMC’s products and services.
16
CONFIGURING A DATA CAMPUS
Glossary BCV (Business Continuance Volume): Business Continuance Volumes are copies of active production volumes that you can use to run simultaneous tasks in parallel with one another. BCV gives you the ability to do concurrent operations, such as data warehouse loads and refreshes or point-in-time backups, without affecting production systems. Business Continuance: The technique of ensuring that a business is able to withstand a natural or man-made catastrophe through the deployment of fault-tolerant and redundant hardware and software systems. Customer Relationship Management (CRM): An enterprise-wide strategy enabling organizations to optimize customer satisfaction, revenue, and profits while increasing shareholder value through better understanding customers’ needs. Enterprise Storage Networking (ESN): Enterprise Storage Networking (ESN) is the combination of EMC hardware, software and services offering choices of open connectivity standards — including traditional channel, SAN and NAS connectivity — under a single information management framework. InfoMover File System (IFS): Enables UNIX operating systems to read OS/390 files directly, eliminating the need to move data and saving valuable file transfer time. Redundant Array of Inexpensive Disks (RAID): Uses the server processor to perform RAID calculations. Host CPU cycles that read and write data from and to disk are taken away from applications. Software RAID is less costly than dedicated hardware RAID storage processors, but its data protection is less efficient and reliable. SAS/Direct for EMC Symmetrix: EMC and SAS teamed to create software that allows SAS users of UNIX to directly access OS/390-based SAS data sets stored on Symmetrix storage systems. SAS/Direct reduces the cycle demand on the mainframe, reduces potential file transfer traffic on the network, and reduces the potential for referencing “out of synch” data. Contact EMC or SAS for more information about SAS/Direct. Sharing Arrays across Multiple Operating Environments: EMC Symmetrix information storage systems can be shared between operating environments. When multiple architectures are involved, you might need ESP software from EMC. Symmetrix Remote Data Facility (SRDF): An enhanced version of real-time mirroring between multiple Symmetrix systems that can include remote and multiple sites.
17
CONFIGURING A DATA CAMPUS
EMC Corporation 35 Parkwood Drive Hopkinton, MA 01748-9103 USA Tel: (508) 435 1000 www.EMC.com
18
41468US.0301