One of the themes of 2020 concerned the ‘democratisation of data’, its growing importance and the changes it might bring to how data is stored and accessed. The key factor here is that with the continued advances in analytics tools, coupled with the arrival of Artificial/Augmented Intelligence and machine learning, the range of applications to which analytics can be applied has grown beyond just the management of real time events in the here and now, stretching that approach back into history.
Using all available data and making it far more widely available to staff within a business is becoming the order of the day and that means bringing legacy data in from the cold of its deep dark siloes. Traditionally such data has been stored on tape and the expectation has been that the vast majority of it will never be read again. So accessing such data has not had the highest priority as that has nearly always been driven by a policy of ‘needs must’.
Democratising data is changing this policy position significantly, for comparing all the legacy data with all the current, real time data can provide businesses with a much richer understanding of their management requirements out into the future
That in turn has leads vendors in the tape business to push ever-harder on the R&D levers, as recently demonstrated by IBM and Fujifilm. The pair have combined to produce new tape read/write technologies from IBM that exploit new, thinner tape materials, coupled with strontium ferrite magnetic materials, from Fuji. The result is the claim of achieving a new areal density record for tape storage systems. Using current, hand-sized tape cartridge technology, this would offer the possibility of individual cartridges offering up to 580 Terabytes of native storage capacity and more than an Exabyte of compressed data.
Together with IBM’s tape library systems and LTFS (Linear Tape File System), this could provide nearline data availability that is said to be analogous to using USB stick memories of humongous capacity.
This work is still at an early stage for both partners and IBM does not expect to see it appearing in available products for at least a couple of years. This is because there is still a good deal of engineering development to go through to get from the current experimental rigs to production-ready systems. It does, however, see it being the core technology of highly scalable tape storage systems out to 2029 and beyond.
CERN, the famous particle physics research establishment that straddles the Swiss/French border, is already firmly committed to the extensive use of nearline data for much of its research activity, according to Alberto Pace, CERN’s Head of Data Storage. His primary job is providing huge volumes of data storage capacity to support and record the science undertaken there. But he also has a lesser-known role in distributing the data appropriately around some 12,000 physicists located throughout the world. In that role there is the need to preserve the data in the long term, and ensure that none of it is lost:
We have more than 3,000 servers, with 220 Petabytes of data on 70,000 disks - and this is for the online part, the hot data. But then we also have 30,000 tape cartridges, which account for 360 Petabytes, in libraries, which are Nearline. It's not that they are cold and disconnected, but they are available on demand as needed. Delivering storage services for large scientific experiments is sometimes more than just providing this storage space, because of the complete set of features and requirements that we have to satisfy the reliability and access control, to ensure that the data is read and modified by the people that are allowed to do that, because we also have a very strong open access policy. And then, of course, the archives, the history, the long-term preservation of this data, this to empower the implementation of a specific workflow that every experiment or every scientific activity needs.
Like just about every other heavy duty user of IT, Pace expects CERN to continue doubling the amount of data stored every year, with the assumption that this will reach the Exabyte scale within the next three or four years. Tape storage and delivery to end users is therefore now a key part of the overall storage strategy:
Clearly it's energy efficient and, when you have this large amount of data to preserve, the lowest cost per Terabyte is essential as the lower the cost, the more you can afford to store. The main disadvantage that many people perceived as being a problem, the high latency, for us is a major advantage, because it's really a requirement that we need. When we want to do long term data preservation we really want to ensure that the data cannot get deleted accidentally. And when you have a large amount of data online on disks or flash, that is always possible in just a matter of minutes, if not seconds. With tapes it would require years of systematic work to delete the data.
Another long-term strategic goal he sees tape storage hitting is that of scalability. Here, the high areal densities and fast data throughput that he sees coming with tape developments, such as those in play from IBM and Fuji, will play a major role.
According to Dr Mark Lantz, manager of the Advanced Tape Technologies group in the Cloud and Computing Infrastructure Department at the IBM Zurich Research Laboratory, the whole tape storage industry is experiencing a renaissance. This is why IBM assigns increasing importance to the work being carried out in Zurich. IBM’s current top-of-the-line systems here include the TS1160 tape drive, which works with hand-sized cartridges that each hold 20 Terabytes of data and fit in to automated robotic library systems of up to 128 drives per library.
The hard disk drive sector (HDD), meanwhile, has experienced a dramatic slowing down in the last few years, particularly in the rate of growth in areal storage density. Lantz says:
Currently, it's scaling at less than 8% compound annual growth rate. Areal density scaling of HDD is critical, because historically that's what's driven the Dollars-per-Gigabyte scaling. The net result is the data center is getting out of balance. We're currently creating data at a much faster rate than we can afford to store it, at least if we want to store all of that data on spinning disk. Fortunately a large fraction of the data that's out there is what we call cool. It hasn't been accessed in a long time or it's very infrequently accessed, and when it is it can tolerate much higher latency. It has by far the lowest total cost of ownership for storing data and if the data isn't actually being accessed, it doesn't consume any power, so it's an extremely green technology.
Lantz also observes that tape’s place in the storage panoply is also being seriously re-assessed by leading hyperscale cloud service providers, pointing out that while Microsoft was saying tape was dead back in 2016, it now argues that all cloud vendors will be using tape at a scale never seen before because of the low cost of ownership for storing large volumes of data, coupled with its ability to scale much further at at time when disk technology appears to be reaching the end of the line.
Another important factor is greater security, conclude Lantz:
“Tape technology provides a natural air gap, an extra barrier against unintentional or malicious attacks against data. At the same time, we have built in on-the-fly encryption in the drive and we also continue to innovate in this space. In fact, last year, we announced a prototype of the world's first tape drive that implements quantum computing safe encryption technology.”
Essentially, the growth in the overall volume of data has growth in file and dataset sizes as a natural consequence. With everyone demanding HD quality graphics and videos as table stakes, that is inevitable. Even where applications require random access, the size of the ‘unit of work’, for want of a better term, starts to put tape in a much stronger position than might have been expected. It starts to make sense in many roles where lightning-fast access times and constant random access data swapping are not the highest priority and linear data reading or writing is the order of the day. While that won’t be in real time event analytics, as soon as comparison between current events and historical data is required it will have a role – and that will be even more the case as the democratisation of data continues and grows.