Today, of course, finding the answer to this question tends to mean a trip online, rather than to some dusty old family records archive – to ancestry.com, genesreunited.co.uk, or indeed, familysearch.com.
It’s through FamilySearch, for example, that the organization’s CTO Tom Creighton discovered that, way back in his family history, certain ancestors were cattle thieves on the UK/Scottish border.
(This jumped out for me, since my own family were cattle farmers around that time, in the same border country. The name ‘Twentyman’ has its origins in the occupation of keeping cows for ‘two winters’, or ‘twinters’ in the local dialect, at a time when fodder to feed them for that long was scarce.)
But regardless of Creighton’s origins, or my own, there’s a huge database effort that goes on, behind the scenes, at FamilySearch. And, for many years, the Family Tree application that sits at the heart of the site was struggling to achieve that with conventional relational database technology, as Creighton explains:
We knew that, at some time, we’d have billions of records on FamilySearch - and we were right. Today, the primary canonical dataset that we work with is around 1.2 billion records. We also expected to have a very high read transaction rate and a reasonably high write transaction rate and we needed a way to cope with that.
This read/write ratio was important to establish, he adds, because while many people visit FamilySearch simply to search for records, others build their own family trees on the site, contribute notes about their own findings and a large crowdsourcing effort is involved in getting volunteers to correct and clarify some of the records on offer there. This work is all carried out through the Family Tree application.
And as that application grew in popularity, Creighton and his team realized they had vertically scaled their RDBMS technology as far as was cost effective. It appeared that 60 million transactions per hour for the Family Tree application was going to be the absolute limit – and also that this wouldn’t be enough. For example, the site experiences its highest volumes of traffic on a Sunday, and using the older technology, would be flirting with its capacity limits every Sunday. Says Creighton:
We did take outages sometimes, because we were taxing four big server machines beyond their capacity. It wasn’t happening often, but I’ll admit there were outages sometimes.
NoSQL + cloud = solution
This led FamilySearch to conduct an in-depth, head-to-head comparison of various relational and NoSQL databases, before finally settling on the Cassandra database, using the DataStax Enterprise distribution of the open source technology. Says Creighton:
That went into production a little over a year ago. The transition to production took less than two hours of being offline. We really were ready in minutes, but we were doing extensive testing before we opened up to the public.
Today, he adds, FamilySearch routinely serves 125 million transactions per hour during peak usage, with plenty of room for future growth. And customers, he claims, experience faster response times, high availability and no database downtime. It’s all located in the cloud, he adds:
We run entirely in Amazon Web Services and we know we’re not getting anywhere near that red line of danger – we’re so far below it. All of a sudden, we have ten times the headroom at significantly less cost than our previous set-up.
Resilience and scalability have also opened the doors to new features and functions – like Record Hints, which helps users make new research discoveries. This would not have been possible with the previous infrastructure, says Creighton:
It’s really great to be able to implement features more rapidly and to implement marketing campaigns that might otherwise stress our systems – but today, the database behind our application doesn’t even breathe hard under pressure.