Intelligent document capture takes the spotlight in Bolshoi Theatre project
- Summary:
- Thousands of volunteers are pitching in to support Bolshoi Theatre's digital archive project in Moscow. They're aided by document capture specialist ABBYY, and its machine learning and analytics tools.
In July, the Bolshoi Theatre in Moscow shocked the world of ballet when it cancelled the hotly anticipated premiere of its show about the life of revered Soviet dancer Rudolf Nureyev with just three days to go.
The reasons behind the cancellation of Nureyev remain unclear and have been the subject of intense speculation. The official explanation from the theatre is that the ballet simply wasn’t ready in time – but it hasn’t been rescheduled and many have suggested that government interference played a role, perhaps due to the ballet’s frank portrayal of Nureyev’s homosexuality.
Whatever the facts of the matter, the Bolshoi Theatre is no stranger to drama, both on-stage and off. Its 240-year history is littered with stories of government crackdowns on artistic expression, along with fires, defections from behind the Iron Curtain, intense personal rivalries and, more recently, a shocking acid attack, perpetrated by a dancer on the Bolshoi Ballet’s artistic director in 2013.
But for those more interested in the cultural masterpieces that the theatre has presented over more than two centuries, an ongoing crowdsourcing project, Discover the History of the Bolshoi Theatre, is doing much to bring new information into the spotlight. In the first stage of this project, over 4,000 volunteers from 60 countries helped to digitize 48,000 posters, 120,000 programmes and 100,000 rare photographs from the archives of the Bolshoi Theatre Museum.
During stage two, these historic artifacts are being analyzed using artificial intelligence (AI) and will be published to a global audience on the theatre’s website, underpinned by a database created by KAMIS, an archiving system for museums and libraries.
Behind the scenes
The idea to create a digital archive was first proposed by opera historian Eugene Tsodokov, with the goal of preserving the theatre’s history and making it more accessible to the public, according to Lidia Kharina, director of the Bolshoi Theatre Museum. Theatre staff took to the idea with enthusiasm, she says, and began manually entering the information from posters, programmes and photos into the KAMIS database. But it quickly became clear that this process was way too slow, she says:
Soon, experts realized that it would take about 40 years to enter all the data correctly, manually. It was at this stage that the Theatre looked to use a technology partner with experience in digitizing objects of cultural heritage around the world.
The search led it to document capture specialist ABBYY, which as Kharina points out, has a long track record in this field, with recent projects focusing on the digitization of plant records at the Royal Botanic Garden Edinburgh, and of the entire works of Leo Tolstoy for a project called “All Tolstoy in One Click’.
During stage one of the project, she explains, the Bolshoi used ABBYY FineReader, an all-in-one PDF solution with advanced optical character recognition (OCR) capabilities, to convert document files into a digital format. Captured text was scrupulously verified by volunteers to find and correct mistakes that can occur during digitization. According to Kharina:
FineReader recognizes 192 languages. It even successfully handled the Old Russian spellings of the heritage content, as well as fragments written in French and English, in intricate fonts.
Stage two of the project, meanwhile, uses text analytics to categorize the unstructured data and put relevant information into the correct database fields of the digital archive, she says.
ABBYY AI experts have written complex and comprehensive rules to teach the data extraction algorithm to take into account the varied structures of heritage documents and the similarity of certain data entities such as ‘role’ and ‘last name’, as well as the order and so on. Machine learning algorithms have then learned from this linguistic and structural information and put the information into the correct database fields. Now, the volunteers are starting the verification process to find and fix any mistakes that could have occurred, making sure that the names, titles and musical instruments used are all found correctly and put in the right field.
The volunteers, drawn from all walks of life and given training and support depending on their technical ability, are using the web interface of the data capture software, accessed via a purpose-built website. Once this process is complete, Bolshoi Museum experts verify the database, which will gradually be published on the Theatre’s main website for public access. Storage for the processed data is provided by another project technology partner, NetApp.
History takes centre stage
The project is helping to uncover previously overlooked facts, patterns and insight, according to Kharina. For example, capturing the Theatre’s World War II collection provides a powerful way to communicate the extraordinary stoicism of the Russian people and the importance of artistic expression as a solace at some of the darkest times in the country’s history. She says:
The historic text comes alive through digitization, and all the information concealed within it is revealed to enrich our understanding of the past and our knowledge of world history. Previously, it was impossible to find the full and consistent record of Galina Ulanova’s ballet roles or Elena Obraztsova’s opera arias, or when Svetlana Zakharova made her debut on the Bolshoi stage. Questions such as these will be easy to answer with the new digital archive.