Scotland’s National Library catalogs value from object storage

Profile picture for user gflood By Gary Flood October 14, 2019
Summary:
Object storage tech is preserving digital assets for future generations.

National Library of Scotland

With over 31 million items in its collections, The National Library of Scotland is Scotland’s largest library. Its size is partly down to the fact it’s a legal deposit library -  which means it has the right to claim a copy of everything published in the UK - but it also collects a lot more than just books. In fact, it contains all types of materials, from books and magazines, to maps and music, photographs, postcards, newspapers, and ephemera such as theatre posters and election flyers is part of ist archive.

As the Library’s Associate Director of Digital, Stuart Lewis, says: 

We use all these collections, through the expertise of our staff, to fulfil the Library’s mission to make a significant and lasting contribution to global knowledge and the memory of the world.

Some 78% of that collection is physical, and in itself is a massive storage task, taking up as it does approximately 120 miles of space – the same length as the London orbital M25 motorway. But as an ever-growing chunk - 21% and counting - of the Library’s collections are now in a digital format, the challenge of what to do about that side of the house was becoming more and more of an issue, Lewis points out.

We ingest significant amounts of born-digital materials - over a million items per year, including collections such as the annual .uk web domain crawl. We are also undertaking mass digitisation of books and maps (over 200,000 last year), as well as films and videos, which can generate very large files.

We currently have seven SANs, which takes significant time and resource to manage separately. As a result, we needed storage that could grow painlessly over the coming years, providing significant extra storage for unstructured data without needing to work around file system limitations, and which could help me and the team in terms of automating the lift-and-shift required every 5 to 8 years as SANs are replaced.

Finally, given the nature of the digital archive files, which don’t change, but which we want to read and validate them all periodically, tape backup was proving unwieldy. The Library had taken a decision to store three copies of all data in different geographic locations, and we wanted a system that could provide two of these copies, and to automate the replication, too.”

Armed with this stiff set of requirements, Lewis and his team went to market to buy a solution for its needs, and based on what he calls “a combination of quality and value for money”, eventually awarded the contract to Scality. Scality is a specialist in building a software-defined native file and object storage solution for large-scale, on-premise storage of unstructured data, something it calls ‘Ring’. Lewis says that the company advised to configure his new large storage solution as a dual-Ring topology, with two separate ‘Rings’, one in his Edinburgh and another in our Glasgow datacentre:

This ensures we have two full copies of the data, so not only is a single site highly fault-tolerant, with the ability to lose servers and disks, we are also able to lose a complete datacentre with no concerns.

Using Amazon S3 ‘buckets’ - S3 standing for Amazon Simple Storage Service, a service offered by Amazon Web Services that provides object storage through a web service interface buckets in S3, the Library now has no problems with managing the number of files in a given structure:

We make use of the S3 Cross Region Replication (CRR) to automatically replicate files, so when files are created in Edinburgh, they replicate to Glasgow, and where created in Glasgow, they replicate back to Edinburgh.

The Library’s initial setup provides about a Petabyte (PB) of storage for each site, but Lewis is happy to say that the servers aren’t even half populated yet:

This means we can easily expand to more than 2PB per site without adding any further servers - and when we need to, we can get more than 4PB per rack due to the density of our HPE Apollo servers and their storage.

So a very convincing solution to that growing digital archive problem, it seems. And In terms of next steps, he states, while the Library first purchased object storage to provide storage for our digital collections, it is now looking at using it to completely replace tape and move to disk to disk backups, as well as to provide more traditional CIFS filestore:

A side benefit of all that is by consolidating most of our data storage onto this technology, we can then reduce our SAN requirement, potentially moving to flash-only SANs for the remaining VM workload - we are about to decommission three SANs, which will provide savings in terms of staff time, separate maintenance contracts, and infrastructure such as fibre channel.

The supplier also built the Library a ‘checksum checker’ which periodically verifies the integrity of all objects, by reading them and recomputing checksums, the results of which are delivered through a dashboard. This provides the team, he says, with an extra bit of extra reassurance that the files being preserved for future generations are still intact.

Stable, robust infrastructure

What other benefits are accruing from an object storage approach? Some of the ongoing maintenance load has now been transferred from infrastructure staff to power-users, as they are able to set up their own buckets and replication, Lewis says, rather than having to configure LANs and file-servers. Even better: if the Library really can move away from local tape storage, that will also bring significant savings from fewer tape systems and hence, reduced overall maintenance costs.

So positive has the experience of working with object storage, concludes Lewis, that he is encouraging partner Scottish cultural organisations to make the switch to S3 and multiple copies of data rather than SAN and disk, which he will be able to offer using the supplier multi-tenancy functionality:

As we continue to collect more and more digital collections and to make these freely available online, having a stable and robust infrastructure like this that can cope with the growing volume of data is of a great benefit to us.