Tip:
Highlight text to annotate it
X
Fact Finders with Aaron Newcomb
Episode 5 Tale of the Tape
I'm Aaron Newcomb and on this episode of Fact Finders
we are going to find out what it takes
to store massive amounts of data,
especially when you can't throw that data away.
To do that, I'm going to go talk to Jason Hick
at the Lawrence Berkeley National Laboratory
and find out how they store their data.
Jason Hick thanks for having us out to your facility today. -You bet.
What do you do here at Lawrence Berkeley National Laboratory?
I'm the Storage Systems Group Lead, so we manage a large tape library
and disk file system for science users
of the office of science within DOE.
And what kind of activity happens here at the laboratory?
Basically here we are a high performance computing facility
for all of the Department of Energy's Office of Science Users.
So we have roughly 4,000 users using our facility,
a broad range of science, astrophysics, biology,
genomics, materials science.
And what kind of data do those experiments produce?
There is two kinds of data that are generated.
Generally they break down into experimental data
or simulation data.
Experimental data comes off of large scientific instruments:
telescopes, accelerators, light sources,
where simulation data is largely about simulations
that the scientists produce to learn more about science.
They have input and output decks, and they analyze the results.
What kind of data does that produce?
Is it small files, or large files or a mix?
The experimental data can be broad ranging,
it is really dependent on what kind of instrument it is,
but generally it is about volume and throughput that you can achieve,
ingest if you will, where the simulation data tends to be larger,
larger single files, and it is really more about bandwidth
to process the simulation and keep up with it.
And how do you store all that data?
We have a center-wide file system and an archival storage system.
The center-wide file system is intended for random access
or for data processing in general, where the archive system
is meant for long term data storage as its primary use.
What is interesting to us is that our archive system has 30% read rate,
so it is a very active archive.
The archive system is a hierarchical storage system
so it has very small disk cache but it is predominantly tape,
and so all of the data is on tape, the question is at momentary ingest,
it may live on disk temporarily.
What does your tape environment look like?
It is right downstairs, why don't we go take a look?
Oh yes, that would be great.
Jason thanks for taking us down to the data center,
and this is really impressive.
Can you describe what we are looking at behind us here?
Yes, we have four SL8500 libraries and they are full of tapes,
so they have close to 30,000 tapes.
They've got the T10KC five terabyte cartridges,
so they comprise about 40 petabytes of scientific data.
We have redundant robots,
so there is actually eight robots in each of the libraries,
and they are constantly serving data to users.
And what kind of software do you use to manage this type of environment?
So we use the High Performance Storage System or HPSS software.
It essentially is a hierarchical storage manager
whose one of its main functions is to figure out,
as a user requests data, where that data resides
and get that tape from where it is, into to a tape drive
and ultimately deliver that data to the user.
Have you also considered other software like LTFS, for example?
Yes, we have. LTFS is exciting for us
because it provides a mechanism
whereby regardless of what file system or storage software you use
would allow us to exchange data between facilities.
So, what benefits do you get from your tape environment specifically?
There is really two main factors for us,
the first being reliability. Tape is ideal for long term data storage.
It doesn't require power if you have data on it
and we have data going back to 1979
so data that is decades old does very well on tape.
And the second reason is economical. We find that at our facility
tape is ten times cheaper than the next storage mechanism
that we have available to us.
So, would it be possible to replicate this environment
using other storage technologies like disk or flash?
No, we have looked at that and the floor space it would require
about twice as much as we are using now,
your power requirements would go through the roof
and the cost factor is significant,
as I said about ten times more than tape.
Well Jason thanks for having us out here today,
this is really interesting.
You bet, it was nice to meet you.
So thanks to Jason at the Lawrence Berkeley National Laboratory
we now know how to store massive amounts of data
using the SL8500 from Oracle. Also, he told me after the interview
that they are actually growing at a petabyte a month
and that they are going to have to expand their tape library
very, very soon. And the cost savings
associated with archiving large amounts of data
extend beyond scientific applications
to commercial and financial applications as well
and virtually anyone with long term data storage needs.
If you want to learn how to store your data on a tape library
head over to oracle.com/storage.
For now, I'm Aaron Newcomb and thanks for watching.