Transcript
Indian Journal of Science and Technology, Vol 9(26), DOI: 10.17485/ijst/2016/v9i26/97260, July 2016
ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645
Performance Evaluation and Scalability of IP-based and Heuristic-based Job Scheduling Algorithm Backup Systems Mahmoudreza Tahmassebpour School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran;
[email protected]
Abstract Background/Objectives: One of the most important factors in the design of enterprise computer networks is data storage method. Increase the volume, recovery and security of stored information is the biggest challenge. Using backup systems is a perfect solution to solve these challenges. Traditional methods can be used to perform backups, but new methods and new backup systems are more efficient. Methods/Statistical Analysis: To analyze and compare overall backup session time, backup session processing time, overhead, fragmentation of large data and efficiency of the IP-based schedule and the heuristic-based job scheduling algorithm. Findings: My analysis show that the simple heuristic-based job scheduling algorithm with adaptive number of active DAs is close to optimal while having no additional computing overhead compared to the compute expensive IP-based solution. The simulator-based on heuristic-based job scheduling algorithm enables the analysis to minimize the number of actively used backup servers while meeting backup window constraints and data restore satisfying rates. Applications/Improvements: This algorithm is close to the results that reduce backup time by 60 percent without additional computing overhead and fragmentation of large data processing compared to IPbased methods. As well as, the algorithm can simulate the analysis of a backup system will help. The proposed approach significantly increases the performance, reliability and quality of implemented solutions.
Keywords: Algorithm, Backup, Performance, Scalability, Schedulin
1. Introduction
Business centers today, given the importance of information and data, cannot afford a risk of data loss. These activities and keeping documents to digital content, new requirements for data protection than traditional methods of data retention have created. As a result, information technology plays an important role in the design and implementation of maintenance and protection of data and information. Now the lack of data protection due to double-digit growth rate of the data is more important1. Existing processes and systems needed to handle the growing volume of data to be optimized automatically. Perform backup and recovery process more efficient and reliable, an important issue that has remained in storage systems. It is estimated that between 60 to 70 percent of the business of managing storage systems for backup and recovery2. * Author for correspondence
Symantec Backup Exec is a backup system for enterprise-wide environment. The goal of server backup and recovery operations data, covering various degrees of failure to reassure the business. Typically, backup time and power requirements of the various tasks do not expect that it tends to increase the backup time. To overcome this problem, each backup task and can job with two systems of duration and throughput. The evaluation of the collected history data in the backup jobs done in a number of previous backup. My aim is to reduce the time to perform backups with the design of a suitable system. In earlier work3, a formula on IP-based of the subject is prepared and available IP-addresses, used to optimize planning. Job of available IP-addresses, optimal scheduling backup sets, so that up to 60% of backup time reduction. However, the solution time for integer programming models is notoriously difficult to predict. I see that IP-addresses available for all tests are double
Performance Evaluation and Scalability of IP-based and Heuristic-based Job Scheduling Algorithm Backup Systems
runtime has the right solution found quickly or how many hours it took to make a good result. As an alternative solution to the classic optimization problem, I propose a heuristic-based job scheduling algorithm4. With this algorithm, the longest backup jobs are scheduled first. In addition, by using an adaptive number of concurrent disk agents over time, I dynamically vary the number of concurrent objects assigned for processing per tape drive. This allows optimizing the tape drive throughput and achieving a significantly lower backup time. In this paper, evaluated the performance and scalability of IP-based and heuristic-based job scheduling algorithm for backup systems. To assess the workload collected from four backup servers is used in the information technology department of Tejarat bank lab. It is interesting results: Performance the heuristicbased job scheduling algorithm is in conditions close to optimal, while performance IP-based solution without calculating the additional time required normally varies between 10 minutes to several hours that it is in poor conditions. Another issue is the limited scalability of IP-based solution. The volume collected from the backup server the heuristic-based job scheduling algorithm, up to 25% better than the scheduled time IP-based acts. In other words, IP-based solution is not good scalability.
2. Motivation for Designing New Backup Schedules Symantec Backup Exec system offers backup and restore functionality specifically for distributed environments and provides enterprise-wide. The system can be used in various environments from a single system to thousands of client machines used on multiple sites. It also supports backup in heterogeneous environments with clients running on the UNIX or Windows platforms. The ability of a backup tool to backup and session topics is supported at the time of the meeting. The traditional process of a backup system using a tape library in Figure 1 for review. The system is called Disk Agents (DAs), are associated with tape drives. Disk backup agent responsible for the object per unit time. Each disk agent is responsible for backing up a single object at a time. Each tape drive has a configuration parameter that can be a concurrency level, for example, the number of simultaneous disk agents, which can backup of different several objects in parallel 2
Vol 9 (26) | July 2016 | www.indjst.org
to the tape drive. Typically, a single data stream cannot be completely backup tape drive capacity and bandwidth due to client machines use low speed (throughput of each client system 10 to 20 MB per second). An administrator can configure a large number of DAs in each tape drive for simultaneous backup of multiple objects at the same time. The disadvantage of this system is the flow of data from various objects on the tape when the data needs to be modified for a particular object, using the correct time later to retrieve data compared with a continuous stream by a disk agent.
Figure 1. The traditional process of a backup system using a tape library.
Normally, there are a large variety of clients and servers as well as data storage exist in the organization. This diversity has an impact on the duration of the backup and throughput, which often leads to differences in the criteria range, can be seen. However, traditionally, no information on the expected duration and throughput requirements of different backup clients is provided. The data protection means only the name, path and type of the client machine. The lack of information about the size and speed of the backup process may be inefficient backup process. In this work, I revisit a traditional backup scheduling and configuration issues in order to automate a design of a tailored backup job schedule by using available historical information.
3. IP-based Schedule Solution In this section, I provide a formula for multiple resource
Indian Journal of Science and Technology
Mahmoudreza Tahmassebpour
planning; scheduling issue with the aim to minimize the total length of time in a backup job is limited. The challenge is a compact formula that IP-addresses available with traditional for example CPLEX5 nin doing proper computation time is important. The following is a very brief and appropriate IP-based formula for this problem is available. The following basic notations: • m - defines for the backup tool configuration in the number of tape drives; • n - defines for the backup set in the number of jobs; • mnsDA - the most number of synchronous disk agents configured per tape drive; • sumT - the sum throughput of the tape drive. Each job Ji, 1 ≤ i ≤ n in a granted backup set is denoted by a pair of properties (Di, Ti): • Ti is the throughput of job Ji and Di is the duration of job Ji. Always, every tape drive can be a mns DA thing to do in parallel, but in general these things cannot process a most capacity of sum T tape drive. It is proper function of processing time span S to a minimum, for instance, the time of completion of S to set backup jobs to a minimum. Inform span S cannot be smaller than the longest backup job in the set (defined as d1): (1) •
Inform span S cannot be smaller than the shortest feasible time that would be required for processing the all set of deferred backup jobs at sum throughput sum T by m tape drives (defined as d2):
•
(5) Mlow is actually a lower limit on inform span S since it cannot be smaller than d1, d2 or d3. But, Mup is a feasible estimate of the upper limit on informs span S, and my current estimate might be incorrect. The optimal out does not appertain to Mup in a direct: as long as S
M low optimize the number of incongruences:
•
(10)
Binding Rijt to binary variables Pij and Qit: (11)
•
Requirements of non-negativity:
(12)
One traditional solution to integer programming, for example CPLEX5, which can be used to find a practical solution. When a scheduled task is optimized by solver, Backup Exec perform tasks in planning is recommended. I created this app as I used a bin-packing plan. My goal is to compare the efficiency of the planning system is IPbased applications bin-packing and the heuristic-based job scheduling algorithm describe in the next section.
5. Heuristic-based Job Scheduling Algorithm A backup tool with m is the number of tape drives: Tp1, …, Tpm. Under a traditional architecture, there is a configuration parameter K that defines a fixed number of simultaneous Disk Agents (DAs) that backup various objects in parallel to tape drives. In this work, check that the tape drive backup system with a variable number DAs by the following parameters are used at the same time indicate: • mnsDA - the bound on the most number of simultaneous disk agents which can be defined per tape; • sum T - the sum throughput of the tape drive. • I apperceive running counters per tape drive: • ActDAj- the number of busy disk agents of tape drive Tpj and initialized as ActDAj= 0; • TpATj- the sum throughput of currently defined jobs to tape drive Tpj and initialized as TpATj= 0. Every job Ji in the next backup session is showed by a multiple (Di,Ti.Oi): • Di defines the backup period of object Oi beholden from the last full backup; • Ti defines the throughput of object Oi computed as 4
Vol 9 (26) | July 2016 | www.indjst.org
•
the last l throughput; Oi is the object.
When the history information on all the ordered objects list sorted created OdOL category: (13) The heuristic-based job scheduling algorithm operates: • Ji = be the highest object in OdOL. The tape drive Tpj have an accessible disk agent: (14) • Tpj is between the tape drives with an accessible disk agent, and Tpj has the smallest sum throughput. • The job Ji is defined to Tpj if its transfer does not disaffirm the most sum throughput determined per tape drive, if this condition is true: (15) If the status keeps then object Oi is defined to Tpj, and the tape drive running counters are updated: (16) Otherwise, the job Ji cannot at this stage planning and allocation process is blocked to some things earlier than planned, completed and additional resources to be released. Directly, under the heuristic-based job scheduling algorithm, the longest work, and the first to be processed. The next object to assign to a tape drive with the largest capacity available to be considered. When work is planned Jk the tape drive Tpi, occupation sources reported are complete and counters run mode, the tape drive can be updated: (17)
5. Comparing Bin-Packing and Heuristic-based Job Scheduling Algorithm Workload for the Evaluation Study To compare performance of the IP-based schedule and heuristic-based job scheduling algorithm, I use four Indian Journal of Science and Technology
Mahmoudreza Tahmassebpour
data backup servers in Information Technology Tejarat bank lab, While Information Technology Tejarat bank lab represent the research organization, its computing infrastructure is a typical instance of a medium size enterprise environment. The distributed server and client machines, including Windows and Linux desktops. There is computing servers and clusters for different research projects, prototyping and advanced. In addition, there is a series of high-end servers with significant amount of stored data. There were 443 objects in the backup sets under study. First of all, there is considerable diversity in durations: some backups take only 1 minute whiles other object stake 10 to 17 hours. Second, there are backup jobs long. There are variations in throughput of 0.1 MB/s to 40 MB/s. The information technology of Tejarat bank lab backup servers have 4 tape drives with a most data transfer rate of 80 MB/s, and configured with 4 DAs simultaneously. There is with throughput up to 20 MB/s representative backup jobs. This explains why the information technology Tejarat bank lab configuration backups are using 4 concurrent DAs but, part of the backup job with much lower throughput done. So used a fixed number of four DAs of traditional backup systems that do not inform the best use of available resources is used in the tape drive. The seeing presents a perfect opportunity for optimizing with new IP-based schedule and heuristic-based job scheduling algorithm that take the job throughput into account.
5.1 Performance Comparison
For comparison study, I run a set of experiments where I process workloads (collected from each of the four backup servers under study) with heuristic-based job scheduling algorithm and IP-based schedules using a tape drive configured with the following parameters: • mnsDA = 10, for example the maximum is 10 simultaneous DAs can be used per tape drive; • sum T = 80 MB/s, for example the sum throughput of the defined simultaneous objects per tape drive that maximum is 80 MB/s. Table 1 shows the overall processing time backup session for four workloads (over three weeks) using different schedules: IP-based and heuristic-based job scheduling algorithm. While significant time savings are achieved across all the four backup servers under IPaddresses available optimal scheduling and heuristic-based
Vol 9 (26) | July 2016 | www.indjst.org
job scheduling algorithm compared to the traditional methods DP (not shown here, the savings are 20% to 60% of the session time reduction), the interesting outcome is that the proposed on-line heuristic-based job scheduling algorithm produces nearly optimal performance results compared to the formed off-line IP-based schedule. The overall duration of backup session for Server 1, Server 2 and Server 3 are either identical or less than 3% apart under the two methods has been considered. These results are consistent for three consecutive weeks used in the study, are shown in Table 1. The only exception is the result for Server 4. The backup sessions for week 1 and week 2 are 10% longer with heuristic-based job scheduling algorithm compared to the IP-based schedule. However, the heuristic-based job scheduling algorithm results for week 3 are again less than 3% away from the performance of the IP-based schedule. Table 1. The comparison of backup processing times under IP-based scheduling (A) with heuristic job scheduling algorithm (B) Backup Server Server 1 Server 2 Server 3 Server 4
Overall Backup Session Time (Min) Week 1 Week 2 Week 3 A B A B A B 1260 1282 1270 1286 1280 1318 688 688 679 702 691 691 840 847 850 857 855 862 665 732 695 771 695 712
Overall, these results speak strongly in favor of the proposed heuristic-based job scheduling algorithm. Note that heuristic-based job scheduling algorithm does not require any special pre-computations: it uses an ordered (by duration) job list. The main disadvantage of the IPbased solution is that it is compute and time expensive. It is variety to predict the compute time required for generating a good, close to optimal solution. While the optimal solutions for Server 2 is less than 1.5 minutes, the solution times for Server 1, Server 3, and Server 4 are for more than 2 hours.
5.2 Scalability Study
To evaluate the scalability of IP-based schedule and heuristic-based job scheduling algorithm, I have created the aggregate workload using workloads from all four backup servers. Workloads of each backup server, 70 to 130 objects are included. The aggregate workload has 443 objects. I have formed the IP-based application optimized
Indian Journal of Science and Technology
5
Performance Evaluation and Scalability of IP-based and Heuristic-based Job Scheduling Algorithm Backup Systems
for the aggregate workload using a backup server configured with 4 tape drives (each drive with the same parameters a considered above). Table 2 presents the overall backup session time for the aggregate workload processed with different timings offers. As I can see, the session time with heuristic-based job scheduling algorithm is up to 25% shorter than with IP-based schedule. When I used a workload of large-scale with 443 objects, the IP-based solution after 2 hours of compute time has produced the job schedule that is still far away from the optimal solution. The IP-based approach does not scale well for the large datasets.
Large-scale distributed systems such as Grid exploit job scheduling as the main component of resource management. Grid-related job scheduling aims to empirically evaluate a variety of scheduling heuristics. Here are a few policies used to improve system utilization and throughput: backfilling12, Adaptive scheduling13 and task grouping14. There are growing number of services and systems that provide efficient backups of file system over the Internet15.
Table 2. The comparison of backup processing time under IP-based scheduling (A) with heuristic job scheduling algorithm (B)
This paper revisited traditional backup job scheduling and discusses inefficiencies due to random job scheduling broadly used in the backup tools. The main reason why random job scheduling is typically used by backup tools is that, generally, information on waiting time and cannot provide backup tasks are required. It clearly shows that each backup job throughput and job duration two criteria are evaluated. These criteria are designed to automate a backup schedule to minimize the total completion time are a set of backup tasks. I compare the efficiency of the IP-based schedule and the heuristic-based job scheduling algorithm. My analysis shows that the simple heuristicbased job scheduling algorithm with adaptive number of active DAs is close to optimal while having no additional computing overhead compared to the compute expensive IP-based solution. The simulator-based on heuristicbased job scheduling algorithm enables the analysis to minimize the number of actively used backup servers while meeting backup window constraints and data restore satisfying rates. As a result, the proposed approach significantly increases the performance, reliability and quality of implemented solutions.
Workload
Sum
Backup Session Processing Time (Min) Week 1 Week 2 Week 3 A B A B A B 1620 1292 1540 1465 1520 1269
In summary, the proposed heuristic-based job scheduling algorithm is very efficient, products nearly optimal results, and performs well even for large datasets. Scheduling input tasks and allocation of processor they are always an important factor to optimize the performance of parallel and distributed systems. To minimize job time, for example, a suitable method for Scheduling the longest job as the first job program6,7 I choose. In6, has proved to be the longest for the first schedule creates optimal schedules with the right strategy. Another example in this regard shows that efficient heuristic may be a good alternative approach to the integer programming approach for building the efficient backup scheduling in practice. Traditionally, Magnetic tape for backup data used in enterprise environments. Unix utility known as dump, CPIO, and tar8 can be a full backup file as a single stream of data sent to tape. Different enterprise environments to implement backup policies defined, such as full or incremental backups, and how long these backups are kept8. Tools like AMANDA9 (built on the dump and tar) manages full and incremental backups of scheduling the process a network of computer and writes these backups to tape as a group. Existing commercial backup tools10,11 scheduling tool allows administrators to client devices using a specific timeline answer. However, the backup system using a random job schedule that can be processed to increased backup and the backup time is inefficient. 6
Vol 9 (26) | July 2016 | www.indjst.org
6. Conclusion
7. References 1. Eizadpanah E, Koroupi F. timing of resources in cloud computing by using multi-purpose particles congestion algorithm. Indian Journal of Science and Technology. 2015; 8(Suppl 8):474-83. 2. Four tips for optimizing continuous data protection, [Online]. Available: http://www.networkworld.com/article/2278838. 19/05/2008. 3. Cherkasova L, Zhang A, Li X. DP+IP=Design of efficient backup scheduling. Proceedings of 6th International Conference on Network and Service Management; Niagara Falls, ON. 2010. p. 118-25. 4. Cherkasova L, Lau R, Burose H, Kalambur SV, Kappler B,
Indian Journal of Science and Technology
Mahmoudreza Tahmassebpour
Veeranan K. Run-time performance optimization and job management in a data protection solution. Proceeding of 11th IFIP/IEEE Symposium on Integrated Management (IM); Dublin. 2011. p. 65-72. 5. CPLEX, [Online]. Available from: https://en.wikipedia.org/ wiki/CPLEX 6. Graham RL. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics.1969; 17(2):416-29. 7. Yoo SM, Youn HY. Largest job first scan all scheduling policy for 2D mesh-connected systems. Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation; Annapolis, MD.1996. p. 118-25. 8. Preston W. Backup and Recovery. USA: O’Reily; 2006. 9. The Automated Maryland Automatic Network Disk Archiver, [Online]. Available: http://www.amanda.org. 20/01/2016. 10. EMC Backup Advisor [Online]. Available from: http://
Vol 9 (26) | July 2016 | www.indjst.org
www.emc.com/products/detail/software/backup-advisor. htm 11. IBM Tivoli Continuous Data Protection for Files [Online]. Available from: http://www-01.ibm.com/support/docview. wss?uid=swg24031940 12. Lifka DA. The ANL/IBM SP scheduling system. Proceedings of Job Scheduling Strategies for Parallel Processing; Heidelberg: Springer Berlin; 1995. p. 295-303. 13. Heymann E, Senar MA, Luque E, Livny M. Adaptive scheduling for master-worker applications on the computational grid. Proceedings of 1st IEEE/ACM International Workshop on Grid Computing, LNCS 1971; 2000. p. 214-27. 14. Tan L, Tari Z. Dynamic task assignment in server farms: Better performance by task grouping. Proceedings of the ISCC; 2002. p. 175. 15. Vrable M, Savage S, Voelker G. Cumulus: Filesystem backup to the cloud. Proceedings of the FAST; 2009.
Indian Journal of Science and Technology
7