Tale of The Tape - Episode 5 of the oracle fact finders series

Fact Finders with Aaron Newcomb Episode 5 Tale of the Tape I'm Aaron Newcomb and on this episode of Fact Finders we are going to find out what it takes to store massive amounts of data, especially when you can't throw that data away. To do that, I'm going to go talk to Jason Hick at the Lawrence Berkeley National Laboratory and find out how they store their data. Jason Hick thanks for having us out to your facility today. -You bet. What do you do here at Lawrence Berkeley National Laboratory? I'm the Storage Systems Group Lead, so we manage a large tape library and disk file system for science users of the office of science within DOE. And what kind of activity happens here at the laboratory? Basically here we are a high performance computing facility for all of the Department of Energy's Office of Science Users. So we have roughly 4,000 users using our facility, a broad range of science, astrophysics, biology, genomics, materials science. And what kind of data do those experiments produce? There is two kinds of data that are generated. Generally they break down into experimental data or simulation data. Experimental data comes off of large scientific instruments: telescopes, accelerators, light sources, where simulation data is largely about simulations that the scientists produce to learn more about science. They have input and output decks, and they analyze the results. What kind of data does that produce? Is it small files, or large files or a mix? The experimental data can be broad ranging, it is really dependent on what kind of instrument it is, but generally it is about volume and throughput that you can achieve, ingest if you will, where the simulation data tends to be larger, larger single files, and it is really more about bandwidth to process the simulation and keep up with it. And how do you store all that data? We have a center-wide file system and an archival storage system. The center-wide file system is intended for random access or for data processing in general, where the archive system is meant for long term data storage as its primary use. What is interesting to us is that our archive system has 30% read rate, so it is a very active archive. The archive system is a hierarchical storage system so it has very small disk cache but it is predominantly tape, and so all of the data is on tape, the question is at momentary ingest, it may live on disk temporarily. What does your tape environment look like? It is right downstairs, why don't we go take a look? Oh yes, that would be great. Jason thanks for taking us down to the data center, and this is really impressive. Can you describe what we are looking at behind us here? Yes, we have four SL8500 libraries and they are full of tapes, so they have close to 30,000 tapes. They've got the T10KC five terabyte cartridges, so they comprise about 40 petabytes of scientific data. We have redundant robots, so there is actually eight robots in each of the libraries, and they are constantly serving data to users. And what kind of software do you use to manage this type of environment? So we use the High Performance Storage System or HPSS software. It essentially is a hierarchical storage manager whose one of its main functions is to figure out, as a user requests data, where that data resides and get that tape from where it is, into to a tape drive and ultimately deliver that data to the user. Have you also considered other software like LTFS, for example? Yes, we have. LTFS is exciting for us because it provides a mechanism whereby regardless of what file system or storage software you use would allow us to exchange data between facilities. So, what benefits do you get from your tape environment specifically? There is really two main factors for us, the first being reliability. Tape is ideal for long term data storage. It doesn't require power if you have data on it and we have data going back to 1979 so data that is decades old does very well on tape. And the second reason is economical. We find that at our facility tape is ten times cheaper than the next storage mechanism that we have available to us. So, would it be possible to replicate this environment using other storage technologies like disk or flash? No, we have looked at that and the floor space it would require about twice as much as we are using now, your power requirements would go through the roof and the cost factor is significant, as I said about ten times more than tape. Well Jason thanks for having us out here today, this is really interesting. You bet, it was nice to meet you. So thanks to Jason at the Lawrence Berkeley National Laboratory we now know how to store massive amounts of data using the SL8500 from Oracle. Also, he told me after the interview that they are actually growing at a petabyte a month and that they are going to have to expand their tape library very, very soon. And the cost savings associated with archiving large amounts of data extend beyond scientific applications to commercial and financial applications as well and virtually anyone with long term data storage needs. If you want to learn how to store your data on a tape library head over to oracle.com/storage. For now, I'm Aaron Newcomb and thanks for watching.