Main content

Optical character recognition takes root in the Royal Botanic Garden Edinburgh's digital archive

Jessica Twentyman Profile picture for user jtwentyman July 16, 2014
Summary:
Capturing digital images of plant specimens and presenting them online offers a new and better way to achieve broad accessibility for botanists worldwide.

Elspeth Haston_RBGE
Elspeth Haston

The preserved plant specimens stored in a herbarium are almost heartbreakingly beautiful, delicate objects. Dried between pieces of paper, mounted on stiff card and often dating back centuries, they are nevertheless expected to perform their role as part of a working reference collection, handled frequently by researchers and sent out on loan to other herbaria for scientists to study.

At the Royal Botanic Garden Edinburgh (RBGE), digitization offers a new and better way to share these specimens with botanists and plant enthusiasts worldwide and, in the process, make a vital contribution to global conservation efforts, according to assistant curator Dr Elspeth Haston.

Haston is rightfully proud of RBGE’s herbarium: much of the collection dates from the 19th century, but its oldest specimen was collected in 1697. Together, the almost three million specimens represent around half to two-thirds of the world’s flora, making it a world-leading botanical collection. Between 10,000 and 20,000 new specimens are added each year, she adds.

This is an internationally important resource, but for hundreds of years, the options for sharing our specimens, so that vital conservation work could be carried out, were limited to people visiting our herbarium in person or us sending specimens out on loan. There’s always a risk of damage, always a risk of loss.

So a key thing for us has been increasing online access to the collection, so we can ramp up research that is being carried out on biodiversity - while we’ve still got biodiversity to protect.

Ten years ago, with funding from the Andrew Mellon Foundation, RBGE was able to start imaging its specimens - but it was still left with the problem of how to capture the text either handwritten or typed on labels of herbarium specimens, in a variety of fonts and languages.

While many of the characteristics of a plant will be immediately visible from the dried specimen, Haston explains, some are not and must be captured in text: where and when it was collected, by whom, the habitat in which it was growing, flower colour or scent, and the height of the tree or plant from which it was collected, for example.

Previously, she says, much of this work was done through manual entry by an RBGE employee - a process that was time-consuming and often resulted in incomplete records in the database:

It quickly became clear that digitising the whole collection was never going to be achievable with that kind of effort.

What was needed was a way to capture the text on labels of herbarium specimens, without losing any information, even from source materials of poor quality or high complexity. And, into the mix, RBGE needed technology that could be incorporated into an existing workflow system and integrated with its Image Management System, where it keeps its digital images in TIFF (tagged image file format) files.

Recognizing character

After attending a workshop run by the British Library, Haston and her team hit upon optical character recognition (OCR) as a way to automatically ‘read’ the text on labels and convert it into editable, usable digital information. After conferring with curators at other leading herbaria, RBGE team selected ABBYY Recognition Server as the best fit.

Today, this technology is used to convert images to text documents for the purpose of classifying, searching and exporting information to RBGE’s internal system for document storage and management. Recognition Server accesses existing TIFFs in a folder on RBGE’s Image Management System.

plant_example
After processing the high-resolution specimen images through Recognition Server, two output files are created: first, a searchable image PDF that RBGE uses as a back-up; second, a plain text file, which is saved in a specified folder on the server.

RBGE’s existing workflow picks up the plain text file from this location, and enters it into a MySQL database, from where it is easily accessible by researchers worldwide, through RBGE’s website,, as well as other online resources, such as the Global Biodiversity Information Facility, Europeanna.eu (an online collection of millions of digitised items from European museums, libraries and archives), the Encyclopedia of Life and the Global Plants website .

While new specimens often arrive with accompanying data that can be immediately be loaded into the database, there’s still a massive backlog to work through, says Haston:

We’ve now databased around two-thirds, or 660,000, of our specimens to some level - but all the ones that are not yet databased need to be tackled. And we only have around 10 curation staff, along with temporary project staff when we have the funding, to help.

The scale of the job is huge. We’ve accepted we can’t capture all of the data straight away: one way is to image the specimens and run them through the OCR software. We can’t automatically parse all that complex data into the various fields - which would be nice - but instead we’re more pragmatic. We use the OCR to sort the specimens into different batches, by collector, or by country, or both. And that allows us to database specimens much faster, because a single person, concentrating on a single collector or a single country, can quickly build up expertise in that field.”

But also, if we can batch them in that way, we’re now thinking that we can bring in ‘citizen scientists’ and create projects for them, where we ask them to help. So a data-entry volunteer could focus on a really attractive project on Charles Darwin or Tierra del Fuego, for example.

The final goal, of making the collection available to anyone, from anywhere, is money-dependent, notes Haston:

We’re aiming for the broadest availability possible.  If we had the funding, we could do it in around five years. That would be great.

Our taxonomists based in Edinburgh are naming or describing, on average, about one new species a week. With flowering plants, we think around 10% to 20% have not yet been discovered and described. They’re still out there, but as many as half of them could already be sitting in herbaria worldwide with an ‘unknown’ label on them.

But what RBGE has proved is that it already has the willingness, the competency and the technology it needs to achieve its goal, she concludes:

The decision to use OCR software is a great argument on our behalf, when it comes to new funding. So is the amount of use we’ve seen of the specimens we’ve already made available online. The number of downloads clearly proves the need.

Loading
A grey colored placeholder image