Transcript
Technical white paper
Archiving Big Data using tNAS with QStar ASM and HP StoreEver tape libraries Table of contents Abstract ..................................................................................................................................................................................2 Big Data ..................................................................................................................................................................................2 Customer example ................................................................................................................................................................2 Goals .......................................................................................................................................................................................3 Challenges ..............................................................................................................................................................................3 Meeting the challenges ........................................................................................................................................................3 Lab testing .............................................................................................................................................................................4 Equipment ..........................................................................................................................................................................4 Investigation process........................................................................................................................................................4 Data flow ............................................................................................................................................................................5 Results and metrics ..........................................................................................................................................................5 Increasing performance .......................................................................................................................................................6 Extra features ........................................................................................................................................................................7 Conclusions ............................................................................................................................................................................7
Click here to verify the latest version of this document
Technical white paper | Archiving Big Data using tNAS with QStar ASM and HP StoreEver tape libraries
Abstract This paper investigates how the HP Tape-as-NAS (tNAS) solution, using QStar Archive Storage Manager (ASM) software and HP StoreEver tape libraries, works with Big Data workflows. This is demonstrated by replicating a real customer example in the lab. The focus is on how the tNAS solution copes with huge amounts of data, large and complex directory structures, obscure file names, and symbolic links. The goal is to determine how Big Data users can use this solution and take advantage of the cost efficiencies of tape storage with minimal impact to current processes. We explore performance options based on our findings and whether the solution can simplify the workflows involved.
Big Data Big Data is cumbersome due to its large scale, making even simple operations extremely lengthy. It can take days or weeks to copy bulky data sets from one place to another. Simple search operations or data reads can take significant time. Advanced data management and Media Asset Management (MAM) solutions have metadata to accelerate the search and retrieval process, however, this can also take substantial time. Additionally, to archive Big Data, you need enormous amounts of storage and patience. Depending on your archive application, you may be tied to that application for future data retrieval, therefore, considerable planning is needed as Big Data is not easy to move from one application to another. You will also need to ensure that you have adequate storage space for hosting the data once it’s retrieved.
Customer example This investigation used a real-world data set from a current HP customer who needed a cost-effective archiving solution for their Big Data. In this customer’s case, one of their data sets is comprised of 100 TB of data made up of around 400 million inodes or files. These files are spread across 12–15 file servers, which are made available to users through 40 NFS mount points. The file servers are divided between two sites. Figure 1. The customer’s Big Data storage file system
2
Technical white paper | Archiving Big Data using tNAS with QStar ASM and HP StoreEver tape libraries
While the data is being used, it is stored on disk arrays provided by the file servers. This gives high speed, random access for the many processes that are involved during development. However, once development is complete only minimal access is needed and it becomes increasingly expensive to store such large amounts of data on disk long term. HP Data Protector software is used to move the data from disk to tape for archive before it is removed from the disk arrays but due to the complexity of the process and the need to make sure the data is no longer needed this process is often delayed negatively impacting overall storage utilization in the data center. If access to the tape archive is needed, the whole or parts of the archive have to be restored. For Data Protector administrators to perform the retrieval operation, pre-planned coordination with the company’s IT support is required to source disk space. Any subsequent changes made to the retrieved data set results in additional manual support.
Goals The goal of the case study is to use QStar ASM with an HP StoreEver tape library to provide a large scale tNAS file system for archiving Big Data. The aim is to use a simple, standard copying technique, such as “rsync”, and then be able to access the files later, either directly from the tNAS mount point or by copying a minimal amount of archived data back to disk. By making the archive available through a NAS mount point, the process of archive, access, and update becomes significantly easier. This allows data to be archived earlier, accessed directly, and more easily by team members without the need for specific application training or IT support. The whole archive-retrieval-update workflow is thereby significantly simplified.
Challenges There are a number of challenges in this customer’s data set: • The scale of the data. 100 TB is a great deal of data. Even at high-transfer rates, it can take weeks just to move it from
one storage location to another. • The large number of files and inodes of varying size. Many files in the data set are very large but also many are very small
including millions of symbolic links. Altogether, there are more than 400 million files and inodes, which would present a challenge to any file system. • The file names. The characters in the file names may not be supported across all platforms. For instance:
– ! ATTENTION: KEY DATA SECTION – data::day_GI::proj1_seq_A_seg_4 Both of these file names use “:” which is not supported by Microsoft® Windows® but is supported by Linux®. Also, because the Linear Tape File System (LTFS) shares the same constraints as Windows the colon is also not supported by LTFS. Some LTFS solutions have put provisions in their code to replace these types of characters but it comes at the expense of changing the file name on the tape cartridge.
Meeting the challenges QStar ASM was developed on Linux and later supported on Windows. As such, the internal file system used by QStar ASM is actually a Linux file system. This means: • There are no significant limits to the size of the file system. If it is currently supported by the file server, it will be
supported by QStar ASM. This has been proven in the lab using about 50 million inodes and files in a single file system without issue. • Symbolic links and file names with colons will work without issue. • In this example, QStar’s TDO (Tape/Disk Object) tape format is required due to the use of colons in the file names.
3
Technical white paper | Archiving Big Data using tNAS with QStar ASM and HP StoreEver tape libraries
Lab testing Equipment To recreate the Big Data archive workflow, the following was setup in the lab with a sub-set of the data—around 12 TB: • HP ProLiant DL380 G7 Server (DL380) client running Red Hat® Enterprise Linux 6.4 with Fibre Channel (FC) • A second DL380 server running Red Hat Enterprise Linux 6.4 and QStar ASM 6.0 with FC • HP P2000 G3 MSA Array System (P2000) with three RAID arrays over FC
– 6.5 TB for one mount point—for client data – 6.5 TB for a second mount point—for client data – 2 TB for the QStar cache—for the server • Windows laptop running the QStar ASM remote administration GUI which provides the same GUI as for Windows even
though the QStar server is running on Linux (which doesn’t provide a GUI). • HP StoreEver MSL6480 Tape Library (MSL6480) with four HP LTO 6 tape drives connected over FC • Two QStar ASM volumes were set up using the remote administration GUI, mounted on the server, and exported to the
client – Archive: For the main archive – Import: For the import of the example Big Data using LTFS tapes from the customer Figure 2. Lab test set up
Investigation process The customer saved the test data as tarballs and used HP StoreOpen to write them to three LTO-6 tapes in the LTFS format. With the files wrapped in a tarball, the file names were not exposed to LTFS presenting no problems. LTFS provided the portability needed between the two different environments making the data transfer and copy process quite easy. These tapes held about 12 TB of test data between them. The LTFS tapes were imported into the import volume of QStar ASM, untarred directly (from the import mount), and spread across the two client mounts. This represented about 10 percent of the customer’s data on two client data mount points.
4
Technical white paper | Archiving Big Data using tNAS with QStar ASM and HP StoreEver tape libraries
The data was then copied from the client data mount point to the QStar ASM archive volume mount point using rsync. No special options or processes were needed. For instance: rsync –av /media/P2000_vdisk1/* /mnt/archive/ rsync –av /media/P2000_vdisk2/* /mnt/archive/ A portion of the data was then read back to a new area and compared with the original. Particular attention was paid to the data’s file ownership, symbolic links, and file names to check for consistency.
Data flow The data flow for this test is outlined below. Refer to Figure 2. 1.
The user uses rsync to copy data from the client data mount point located in one of the 6.5 TB arrays in the P2000 MSA Array System.
2.
This data is copied to the QStar ASM volume mount point which is virtualized by the QStar ASM service.
3.
The QStar ASM service copies the data immediately to the QStar ASM volume cache on a different array in the P2000.
4.
Based on policy, the QStar ASM service formats the data (using its TDO format) and writes it to tape in the tape library.
5.
QStar ASM manages all the tape movement and drive selection.
Data retrieval uses the opposite flow and is managed by the QStar ASM service: 1.
The user is able to see the entire archived file system through the QStar ASM volume mount point, provided and virtualized by the QStar ASM service.
2.
The user uses rsync to copy the desired files from the QStar ASM volume mount point to the desired mount point on the client server. The destination mount point can physically be anywhere. In the test rig, it was on the P2000.
3.
It is unlikely that the required data is still in the QStar ASM volume cache, therefore, the QStar ASM service loads the appropriate tape into one of the tape drives in the library and reads the required data into the cache. If the data is still in the cache, this step is not needed.
4.
The QStar ASM service supplies the data to satisfy the rsync copy which is transferred from the cache to the desired destination. This is a mount point-to-mount point transfer with the actual data moving from one P200O array, via the QStar ASM server, to the client server and back to another P2000 array.
5.
The data remains in the cache until the space is needed for further archiving or retrieval.
6.
The tape remains in the drive for approximately 10 minutes in case it is needed again. This enables the system to be primed if subsequent retrievals from the same tape are needed.
Results and metrics The process was very straight forward, therefore, it was just a matter of watching the data being copied and measuring the performance, which averaged 50 MB/s. This is a reasonable data rate considering the variation in file size and large number of small files. Previous testing has shown a maximum rate of 140 MB/s is possible into a single QStar ASM volume with consistently large file sizes of 4 GB using a fast disk array. Although this test only represented about 10 percent of the customer’s full Big Data set, the copy function took a long time—12 TB at 50 MB/s required over two full days. The read function also transferred at around 50 MB/s, which was expected for the same reasons. Analysis of the restored data shows that the file names, symbolic links, ownership, and permissions were all consistent with the source data.
5
Technical white paper | Archiving Big Data using tNAS with QStar ASM and HP StoreEver tape libraries
Increasing performance Although 50 MB/s is a reasonable rate given the file sizes involved, it still takes a long time to transfer 100 TB of data, around 23 days, in fact! This is not untypical for the customer as they currently spend multiple weeks copying and archiving their data post development. The transfer rate achieved above is to and from a single QStar ASM volume for which the limiter is the overhead of small files and constructing the file system in the volume. Performance can be increased significantly by using more than one volume in parallel as long as it is possible to divide the data between them. In this case study, the data had a well-defined file structure at the top level, therefore, significant portions of it could be allocated to different volumes. For instance: /work/group/pre archive 1 /work/group/mstr archive 2 /work/group/post archive 3 And so on. The volume management overheads on the server are very low enabling even a single server to support 10 or more volumes. In this case study, the top level structure had 5 sub-directories allowing a straight forward use of 5 volumes and providing an opportunity to scale the performance to 250 MB/s. This could be increased further by dividing at lower points in the file system. Note that each volume requires at least one tape drive for writing so a five volume solution would need the tape library to be configured with: • 5 tape drives for writing and reading back later • 2 or more tape drives for reading (for any concurrent read access) • 2 or more tape drives for maintenance (erasing tapes etc.) and spare
A 5 volume solution would reduce the archive time from 23 to around 4 or 5 days, which is a significant reduction! Figure 3. Example lab set up—twice the performance with two QStar ASM volumes
6
Technical white paper | Archiving Big Data using tNAS with QStar ASM and HP StoreEver tape libraries
Extra features QStar ASM can also write-protect the archive for security and supports automatic tape copies, which can be made to an offsite library, for added confidence.
Conclusions QStar ASM with HP StoreEver Tape Libraries (such as the ESL G3 or MSL 6480) provides an excellent tNAS solution for archiving Big Data. It handles large volumes of data and a variety of file sizes and file types exactly the same way as Linux, thereby providing a reliable file system for this type of data. It allows simple copying techniques, which greatly simplifies the process of subsequent access, retrieval, and update. With the NFS interface, both the archive and retrieval can be done using rsync or any other file system operation—without any independent software vendor (ISV) expertise—which allows access to anyone at any time. Making updates to the archive is also simplified as the files can be updated directly on the mount point of the archived data. Old versions of modified or deleted files are retained on tape and are still available using the “mount on date” feature of QStar ASM, which can be extremely useful. It is possible to scale the performance by using multiple QStar ASM volumes as long as the data is readily dividable. This can drastically reduce the archive and retrieval time from weeks to days. Setting write protection and creating tape copies for offsite storage will help to ensure that your Big Data is safe and secure. If you have Big Data to archive, then the HP Tape-as-NAS (tNAS) solution using HP StoreEver Tape Libraries with QStar ASM can provide you with a cost-effective way to store and access all your information quickly, easily, and reliably.
Learn more at hp.com/go/StoreEver
Sign up for updates hp.com/go/getupdated
Share with colleagues
Rate this document
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Microsoft and Windows are trademarks of the Microsoft Group of companies. Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries. 4AA5-5144ENW, September 2014