Preview only show first 10 pages with watermark. For full document please download

Activities To Simulate Storage Configurations

   EMBED


Share

Transcript

196 Int'l Conf. Frontiers in Education: CS and CE | FECS'17 | Evaluating Storage Performance: Activities to simulate storage configurations Adam H. Villa Providence College, Providence RI, 02918 USA Abstract - Storage systems are important topics of study for any computer organization or operating system course. Due to their complex nature, storage concepts are typically discussed theoretically without experimentation or hands-on activities. This paper details lab activities that can be easily used to reinforce key storage concepts and allow students to examine the impact that storage settings can have on performance metrics. These activities enable students to utilize complex storage simulation software using preconfigured settings. Students can examine the impact of scheduling algorithms, storage hardware enhancements, and RAID arrays. Successful completion of these activities will also allow students to conduct research projects on storage performance evaluation. Keywords: Storage Simulation, Computer Education, Lab Activities, Performance Evaluation 1 Science Introduction A key component of computer architecture and operating system courses is the study of storage components and the storage subsystem. Students learn these concepts through lectures, readings, and examples. Due to their complexity, students may find it difficult to closely examine how the configuration of these devices can impact system performance. Students are even less likely to be able to work hands-on with advanced storage configurations, such as RAID arrays. A simulation of storage devices allows students to witness firsthand how configuration choices can affect the performance of the storage system and the entire computer. Simulation environments also allow them to construct a variety of different types of arrays with a whole host of various configurations. Over the years, many different storage simulation tools and applications have been created. Most notable for its accuracy and widespread adoption both in academia and in research is Disksim. DiskSim is a well-known, extensive storage system simulator. DiskSim was developed at the University of Michigan and enhanced at Carnegie Mellon University’s Parallel Data Lab [1]. The developers of the software describe Disksim as “an efficient, accurate, highlyconfigurable disk system simulator that has been validated both as part of a more comprehensive system-level model and as a standalone subsystem”. It has been updated several times and other teams have contributed to the development process. Most notably, researchers from Microsoft [2] and Western Digital [3] have developed additional functionalities for the simulation system. Other storage simulation tools are also available, including the University of Rochester’s instructional disk simulation tool called Vesper [4]. This tool provides students with the opportunity to run a simplified version of storage simulations. The authors state that Vesper “retains simplicity while providing timing statistics close to that of real disk drives”. They also provide possible assignments for their system. Storage vendors have also introduced simulators to aid in the education of advanced storage systems. The EMC Corporation implemented the Academic Alliance Program [5], which allows students and educators to access courses and course material pertaining to storage infrastructure [6]. The educational concepts covered by these resources are applicable to higher-level elective courses that cover advanced storage system infrastructures [7]. ACM Inroads also documented the usage of EMC’s Academic Alliance Program in 2013 and illustrated its member benefits and possible course offerings [8]. Other courses and technologies have built upon EMC’s resources [9]. In [10], the educators develop tools to explore network storage technologies. After reviewing the variety of storage simulators available, Disksim was selected as the ideal candidate for building customizable lab activities for this paper. DiskSim is extremely robust and offers the users a plethora of configuration options and a tremendous amount of simulation output. Creating the configuration files required for the simulator is not trivial and would be beyond the skillset of most lower-level undergraduate students. The hurdle of compiling the simulator and developing setup files is often too great for many introductory students. This research project provides students with preconfigured files to immediately run both simple and complex simulations. The lab activities described in this paper are suitable for any undergraduate student studying storage system comments. 2 System Environment The lab activities and the simulation software for emulating the storage subsystem are implemented in a Linux (Ubuntu 32-bit) operating system. The entire operating system, including the pre-configured DiskSim software, and the activity configuration files are available through a virtual machine image. The image is stored in the Open ISBN: 1-60132-457-X, CSREA Press © Int'l Conf. Frontiers in Education: CS and CE | FECS'17 | Virtualization Format (OVF). The image is accessible and can be downloaded via this hyperlink: http://bit.ly/disksimPC. The username for the root user is “providence” and the root user’s password is “harkins”. The OVF file can be imported into any virtualization software, however Virtual Box works well with the file and has no license fee [11]. After booting the virtual machine and logging into the system, the student will find a directory called “activities” within the root user’s home directory. Each of the lab activities has a separate sub-directory that contains the required parameter and disk setup files for that particular experiment. DiskSim requires two user configurable setup files: a parameter file (.parv) and a disk specification file (.diskspec). The remaining required files are generally not modified for by a standard user. The parameter file contains the settings to control the simulation of workloads, called synthetic generators. It also includes the settings to organize disks into storage arrays. The user has complete control over the disk, the IO driver, and the IO subsystem in this parameter file. The user can also select all of the output values that should be generated by the simulation. This file is very complex and therefore should be left generally intact by novice users. During different steps of the lab activities, the students are asked to make slight modifications to specific lines of code in the parameter file. Only these setting should be modified. Once the user is comfortable with the parameter file, they can be encouraged to make additional changes for experimentation using the Disksim user manual for reference. The disk specification file contains the settings to control the configuration and operation of the disk drives used in the simulation. There are a variety of configurable options, which most notably includes the scheduling policy for the request queue at the disk. Settings to control caching and prefetching can also be found in this file. Again, it is recommended for novice users to make limited modifications to this file until they are experienced with the simulation system. If changes are required, all disk configuration settings are well documented in the Disksim user manual. The lab has the student utilize a simple text editor (nano) to make edits to the setup files and create intermediate result files. The students also use the built-in GREP application to search the output generated by the simulator for specific performance results. To aid students with searching for results in the output, a simple script is also provided to the students. This script automatically searches the output file for specific values, which include: response time, disk service time, seek distance, and queue lengths. The hard disk drive selected for use in all of the lab activities is the Seagate Cheetah 9LP (ST39102LW) and the following specifications are given in the technical documentation of the drive [11]. This drive is part of the DiskSim disk specifications provided with the installation files. The Cheetah 9LP drive has 9GB of formatted storage and has a disc rotation of 10,025 RPM, which provides an average latency of 2.99 msec. The average read time is 5.4 msec and the average write time is 6.2 msec. There are 197 6,962 cylinders with a total of 12 read/write heads. The disk also includes a 1,024 kbyte multi-segmented cache. Disksim has been validated for this drive and has proven that it produces nearly identical performance metrics to the real drive. 3 Activities The lab activities presented in this paper take the students on a journey of exploration through storage systems. Students begin with simulating a single hard disk drive in activities 1, 2, 3, and 4. They then explore RAID storage arrays in activities 5 and 6. Each of the activities has a synthetic workload that is customizable by the student. There is a sub-directory for each activity that contains all necessary files. The Disksim application is accessible from any directory. In addition to the Disksim files, an additional script is provided for each activity. This script called “analyze” automatically extracts key performance metrics from the simulation output file. Different activities require the student to examine different metrics and therefore the script is customized for each activity. Once the student is comfortable with the simulation and analysis process, the student can examine the complete output file and even make changes to the script. 3.1 Activity #1 - Single Request - Sequential Workload This introductory activity allows the students to get accustomed to the simulation environment. In this experiment, the storage system is configured as a single disk with each request being sequentially ordered. The simulation is organized as a closed system where the disk is only servicing one request at a time. There will be no requests waiting in the queue. As soon as one request is serviced, another request is immediately submitted to the storage system. In the DiskSim simulation environment, the Synthetic Generators portion of the .parv file controls the workload being generated for the storage system. The lines of code shown in Figure 1 illustrate the configuration for this activity. The settings “Probability of Sequential Access”, “Probability of Read Access”, and “Probability of Time-Critical Request” are all set to 100%. The “Probability of Time-Critical Request” setting ensures that it is a closed system and since there is only one synthetic generator, there will only ever be 1 request being processed by the disk at any given time. To begin the simulation, the student would navigate to the Activity 1 directory and then execute this command: “disksim exp01.parv output.txt ascii cheetah9LP.trace 1”. This would initiate the simulation and when completed the results would be stored in output.txt. With all of the output option selected in the configuration files, the output is over 8000 lines. To help the students find values of interest, they are directed to use GREP for faster searching. A script is also provided to them as a starting point to automatically get a ISBN: 1-60132-457-X, CSREA Press © 198 Int'l Conf. Frontiers in Education: CS and CE | FECS'17 | disk seeks of zero distance is 99.975%, which clearly shows that all of the requests were sequential. This has a significant impact on the Disk Seek Time average, which is essentially 0ms. This results in the positioning time being comprised solely of the rotational latency. This simple experiment provides the students will baseline values to compare against for future activities. It also gives them practice with using the system and reading the output performance metrics. The next three activities allow the students to experiment with different types of synthetic workloads and different disk configurations for a single hard disk drive. 3.2 Figure 1: Synthetic workload settings from the .parv parameter file for Activity #1. set of commonly desired disk performance metrics. The output of the metrics gathered by the script are shown in Figure 2. Activity #2 - Single Request - Random Workload The configuration for activity 2 is a slightly altered version of the first activity. In this setup the key difference is that the requests are no longer sequential. The Probability of Sequential Access setting is set to 0%, which forces all of the requests to be random. The remaining settings are untouched from the first activity. There will still only ever be 1 request present in the system at any given time. The students are instructed to run the simulation and gather performance metrics using the analyze script. They should then compare the results to the previous activity. The most remarkable difference that they should notice in their comparison are the metrics related disk positioning, including the seek time and seek distances. Since the requests are now for random locations, the read/write heads for the disk will be travelling a great distance for each request. This fact can be clearly identified by the student if they are instructed to examine the Average Seek Distance metric, which is 2263 blocks. The addition of this seek time to each request impacts the response time as well. The average response time is now 19.4 ms and the maximum is 28.8 ms. The students should also be guided to examine the complete output.txt output file and focus on the Disk Seek distance metrics. They can analyze the distribution of seek distances for all 20000 requests. 3.3 Activity #3 - Multiple Random Requests - Disk Scheduling Policy Figure 2: Output metrics for Activity #1 The simulation involved 20000 disk requests with a request size of 100 blocks. (These settings are also shown in Figure 1.) The Disk Completely Idle time was 9.55 ms (almost 0% of the simulation), which indicates that the disk was constantly in use. The average response time was 15.19ms and the maximum was 17.55ms. (The output file provides standard deviation and distribution values for all of these metrics as well.) Most importantly, the percentage of The third activity introduces the concept of multiple concurrent requests. In this experiment, the simulation is modelled as a closed system with 10 synthetic generators. This setup ensures that there will be exactly 10 requests present in the system at any given time. To create this setup, the synthetic generator section of the parameter file is populated with 10 separate generator blocks, similar to this: Generators = [ disksim_synthgen { … }, disksim_synthgen { … }, … disksim_synthgen { …}]. Each of the synthetic generators has the same configuration as Activity 2, which means that each “user” in the system will be asking for random blocks from the disk. Since there are now multiple requests present at the disk, there will be a queue in the disk’s storage controller. Once ISBN: 1-60132-457-X, CSREA Press © Int'l Conf. Frontiers in Education: CS and CE | FECS'17 | that the queue is enabled, a queue service policy must be used. The default service policy is set to First Come First Serve (FCFS). The students are instructed to first run the experiment exactly as configured. After running the analyze script and obtaining the desired performance metrics, the students are then directed to change the service policy of the queue by modifying a setting (Scheduling Policy) in the cheetah9LP.diskspecs file. They are given a list of possible scheduling policy values: 2 (Elevator LBN), 3 (Cycle LBN), and 4 (SSTF LBN). (The DiskSim manual lists 27 different scheduling policies that are implemented in the system.) The students are asked to run the simulation for each of the other 3 policies and compare their output metrics. The screenshots contained in Figure 3 illustrate the performance metric differences between the four scheduling policies. The students should compare and contrast these key metrics: response time (avg/max), access time (avg/max), seek distance (avg/max), and queue time (avg/max). They will be able to observe how the scheduling policy impacts the performance metrics. If the students are interested, then can try to run additional experiments using the other scheduling polices listed in the manual. This is a great research project for the students to undertake outside of class. 199 3.4 Activity #4 - Caching and Pre-Fetching This activity allows the students to analyze the impact of caching and pre-fetching at the storage controller. The hard disk has a limited amount of fast storage space that it can use for servicing read and write requests. Accessing data in the cache saves a significant amount of time over accessing the data on the disk. Prefetching allows the disk to read additional blocks past the requested blocks. Ideally the prefetched data would be placed in cache to service a future request. For this experiment, the .diskspecs file has already been modified for the student. This configuration creates 11 segments in the cache, each with a size of 561 blocks. The caching system and prefetching are disabled to start, as shown in the box on the left in Figure 4. The parameter file for the experiment is configured similar to Activity 3. There are 10 generators that are issuing requests in a closed system. Each generator would only have 1 outstanding request. All of generators are initially configured to request sequential block addresses. This essentially creates 10 sequential workloads. Requests at the disk queue are serviced using the FCFS scheduling policy. After running the experiment, the students are directed to examine the performance metrics gathered by the analyze ISBN: 1-60132-457-X, CSREA Press © 200 Int'l Conf. Frontiers in Education: CS and CE | FECS'17 | script. The students are then directed to turn on the caching and prefetching by changing the settings shown in the right box of Figure 4. They then re-run the experiment and gather performance metrics. Figure 5 shows the output gathered by the students for both experiments in this activity. The students should be directed to compare the disk access times for both experiments. They should identify that the disk access time decreased due to the use of caching and prefetching. They should also notice that the overall response time also decreased. The Disk Buffer Hit Ratio went from 0% to 100% when caching and prefetching were enabled. Students could then investigate the advanced features of caching and prefetching supported by DiskSim using the software reference manual. This self-directed study would be beneficial to the students. 3.5 Activity #5 - RAID-5 The fifth activity in the sequence introduces the simulation of a RAID array. Prior to this activity and the subsequent activity, ideally students would have student RAID arrays and their configurations in readings or lectures. This will allow the student to better interpret the performance metrics generated by the experiments. The simulator is configured to allow the students to examine the performance metrics of the entire storage array and performance metrics for each of the individual disks in the array. The configuration implemented in this activity is designed to mimic a RAID-5 setup that contains 10 identical hard disks. The disk drive used in the array is the same Cheetah9LP disk from the previous experiments. The array controller allows for outstanding requests to be queued and services them using the CYCLE scheduling policy. No requests are queued at the individual disks in this activity. The synthetic workload for the experiment is configured as single, open source workload. There is one synthetic generator that is issuing read requests for random addresses. This is the first time that the students are exposed to an open system where there may be a varying number of requests present at any given time. Requests are submitted to the system using a uniform distribution with a value of 9.55ms. (DiskSim supports a variety of probability distributions, which are documented in the user guide.) The students run the experiment exactly as initially configured. The students are then directed to modify the interarrival times to be 7.55ms, which would produce a heavier workload for the storage system. The students are finally asked to re-run both experiments (9.55 and 7.55) with all write requests instead of read requests. A modified version of analyze is given to students, which includes new metrics applicable to the RAID storage system. Students are also encouraged to explore the output.txt file to see all outputted metrics. Figures 6 and 7 illustrate output metrics obtained for these experiments. The students should be able to identify differences in performance metrics for all four experiments. They can first examine the number of requests handled by the entire array (Overall I/O System total requests handled) and by each of the disks in the array (IOdriver #0 device#x Total Requests handled). For the read workload, each disk had 25% of the number of requests submitted to the entire array. The students should be asked to explain why this is happening. They should also examine the same metrics for the write workload where each disk serviced 67.5% of the number of requests submitted to the entire array. They should investigate this difference as well. Another concept to point out to the students for this activity is the fact that the inter-arrival time has an impact on the ISBN: 1-60132-457-X, CSREA Press © Int'l Conf. Frontiers in Education: CS and CE | FECS'17 | queue length at the RAID array controller. As more requests join the system, the queue length will grow. This is extremely noticeable for an all write workload that takes longer to service requests. A supplemental activity to ask the students to explore on their own is the study of random vs. sequential requests. The students should re-run the set of experiments using a closed system of sequential generators, similar to the setup in Activity 4. This will require the students to modify the .parv file. They can then compare their results to the same workload for the next activity. 3.6 Activity #6 - RAID-1 The sixth activity continues to an exploration into RAID arrays. The configuration of the array is identical to the previous experiment with the exception that the 10 disks are utilizing RAID-1 (mirroring). All 10 disks contain identical copies of the same data set. The configuration of the synthetic generator is the same as the previous activity. The students are asked to re-run the exact same configurations: inter-arrival times of 9.55 and 7.55, as well as 100% reads and 0% reads. The analyze script remains the same from the previous activity as well. The students are asked to examine the output of the four experiments for the RAID-1 array. They should hopefully notice the stark difference in performance between reads and writes, due to the nature of mirroring. They should also notice the miniscule difference in performance between the two all read requests experiments. The students are then tasked with comparing the RAID-5 and RAID-1 experiments. They should focus their evaluation on the number of requests and the response times. 201 If the students undertook the advanced portion of Activity 5, they can also edit the .parv file for this configuration to simulate 10 sequential streams in a closed system. After rerunning all four experiments with this setup, they can compare the performance metrics for both the RAID-1 and RAID-5 arrays. 4 Summary and Future Work The lab activities described in this paper provide hands-on experiments with storage systems and storage arrays. These activities are designed to supplement a students’ examination of hard disk drives and RAID arrays. Utilizing a fullfledged, validated storage simulator also provides the students with experience conducting experiments and provides core concepts related to performance evaluation. The primary outcome of this paper is the development of storage lab activities and their pre-configured simulation software. The goal of the project was to create an easy to use system that can be quickly installed and utilized by undergraduate students. By making the process simple and straightforward, the students are able to concentrate on the output of the simulations instead of creating the required parameter and specification files. Once a student has a firm grasp on how the simulator works and how to properly configure a simulation, a student can run any number of additional experiments. For an advanced activity or research project, students could use the .parv files from Activity 5 and Activity 6 to re-create any RAID configuration. The DiskSim user guide will also be extremely helpful to the students. They can focus on specific parameters of interest to see how they impact various performance metrics. Advanced students can even modify the actual Disksim Code to create additional functionalities. ISBN: 1-60132-457-X, CSREA Press © 202 Int'l Conf. Frontiers in Education: CS and CE | FECS'17 | Disksim is written in C and all source code is provided in the root user’s home directory. There have been many research studies that have successfully augmented the Disksim application. This very advanced project could be used for independent research projects or even an honor’s or master’s thesis. After surveying students for the past five semesters, students have identified these activities as being critical for successful acquisition of storage system concepts. Students found them to be fun and interesting. Several students have also expressed interest to continue working with the storage simulator for independent research projects. It also offered students the chance to improve their Linux skills as well. For future work, additional disk models and solid-state drive simulations will be added to the lab activities. This will provide a comprehensive view of storage system components. To assist both traditional and distance learning students, an online version of the activities is currently under development. This web application will allow students to run simulations using a web browser on their computer or mobile device. 5 References [1] J. Bucy, J. Schindler and S. Schlosser, "The DiskSim simulation environment (Technical Report CMU-PDL-08101)," Carnegie Mellon University, 2008. [2] N. Agrawal, V. Prabhakaran, T. Wobber, J. Davis, M. Manasse and R. Panigrahy, "Design Tradeoffs for SSD Performance," in USENIX Annual Technical Conference, 2008. [3] Western Digital, "Disksim 4.0 (64-bit version)," [Online]. Available: https://github.com/westerndigital corporation/DiskSim. [Accessed March 2017]. [4] P. DeRosa, K. Shen, C. Stewart and J. Pearson, "Realism and Simplicity: Disk Simulation for Instructual OS Performance Evaluation," in SIGCSE, Houston, TX, 2006. [5] "EMC Academic Alliance," March 2017. [Online]. Available: https://education.emc.com/academicalliance. [6] E. VanSickle, E. Mallach, B. Cameron, D. Dunn, D. Rook, F. Groom and R. Rollins, "Storage technologies: an education opportunity," in Proceedings of the 8th ACM SIGITE conference on Information technology education, Destin, FL, 2007. [7] E. Van Sickle, E. Mallach, B. Cameron, D. Dunn, D. Rook, F. Groom and R. Rollins, "Storage Technologies: An Education Opportunity," in ACM SIGITE Conference on Information Technology Education, Destin, Florida, 2007. [8] K. Yohannan, "Transforming Students' Views of Data with EMC Academic Alliance," ACM Inroads, pp. 48-50, December 2013. [9] V. Jovanovic, T. Mirzoev, L. Toderick, R. Homkes and M. Stockman, "Engaging students in information storage management courses," in ACM conference on Information technology education (SIGITE), New York, NY, 2010. [10] V. Jovanovic and T. Mirzoev, "Teaching Network Storage Technology - Assessment Outcomes and Directions," in ACM SIGITE conference on Information technology education, New York, NY, 2008. [11] S. T. LLC, "Disk Specifications for Seagate Cheetah 9LP," 29 March 2017. [Online]. Available: http://www.seagate.com/staticfiles/support/disc/iguides/scsi/2 9230b.pdf. ISBN: 1-60132-457-X, CSREA Press ©