How graph database tech helped expose international tax evasion
- Summary:
- Mar Cabra, editor of the data and research unit at the International Consortium of Investigative Journalists, explains how graph database technology helped reporters ‘follow the money’ in the HSBC Swiss Leaks case
In February this year, the International Consortium of Investigative Journalists published on its website information relating to a giant tax evasion scheme allegedly operated with the knowledge and encouragement of British multinational bank HSBC, via its Swiss private banking subsidiary.
The ICIJ’s findings, outlined in its report Swiss Leaks: Murky Cash Sheltered by Bank Secrecy, sent shockwaves around the world. The report provided details of over 100,000 individual clients and 20,000 offshore companies channeled more than 180 billion Euros through accounts held with HSBC in Geneva between November 2006 and March 2007 and accused the bank of profiting from the custom of tax evaders, money launderers, fraudsters and other wrongdoers.
For Mar Cabra, editor of the data and research unit at the ICIJ, the report’s publication was the culmination of a huge effort - one of the largest journalistic collaborations of all time - to analyse information held in some 60,000 files, originally leaked by a former HSBC staffer, software engineer Herve Falciani.
The ‘Swiss Leaks’ story was by no means the organisation’s first experience with a complex dataset, she says, but in order to pick apart its details, she saw straight away that the 170 journalists working on the case would need a data analysis tool that allowed them to probe the connections that no doubt existed between individual HSBC account holders.
After all, the need to make thousands of documents scalable was something that the ICIJ could address with tools it already had in place. Here, it uses open-source search software Apache Solar with a Blacklist front end, as well as the Nuix e-discovery platform. But the need to ‘join the dots’ between individuals presented a more tricky challenge - until Cabra hit upon the idea of using a graph database.
It’s very important when we’re dealing with offshore companies and bank accounts created in secretive places like Switzerland to be able to make connections, because the connections are what matters in a case like this. They show you who’s doing business with who. That’s where the real stories lie.
For this, the ICIJ used the Neo4j graph database from Neo Technology with the Linkurious front-end, user-interface tool to explore the 275,000 ‘nodes’ (names of individuals, companies and their countries) and 400,000 relationships that existed between them. Cabra’s hunch was correct: this graph visualisation approach allowed ICIJ journalists to identify the connections between people and bank accounts, helping them to ‘follow the money’ to identify thousands of instances of fraud, corruption and tax evasion. Cabra explains:
What this combination of Neo4j and Linkurious provides is a simple way to present graphs so that reporters can explore data without having to rely on data scientists of developers acting as intermediaries. Many of our reporters are very traditional - they’ve been investigative journalists for a long time but they are not all that technically savvy.
What Linkurious does is allows them to click on dots that represent our nodes to expand networks. They don’t need to know about data analysis of data processing. They just search for a name and, if it’s in the database, it appears on their dashboard as a dot, along with details of how many other nodes have a connection to that dot. They double-click, and they can see all those dots and the links between them. They keep expanding and expanding, finding new connections all the time.
This proved invaluable to the Swiss Leaks story - not just for discovery purposes, but also for fact-checking. The ICIJ is continuing to use the Neo4j/Linkurious combination as it works with reporters in countries where details of the HSBC scandal haven’t yet been published, helping them to find a local angle in the case. And it’s using the tools in new investigations, too, she says - a process helped by recent updates to Linkurious"
In the latest version, there are some nice new features: you can assign colours to dots to categorise them, for example, so where a dot represents a company, it might be green in the case of companies that are still operating and red in the case of those that have been closed down. And if you know how to code, which some of our journalists do, there’s the ability to create more complex queries for exploring more multi-dimensional relationships between people and organisations.
It’s not very easy or natural for the human brain to think in terms of networks, she says. That’s certainly true if you consider, for example, a complex family tree: you might be able to identify a third cousin, for example, but would you be able to work out their relationship to your maternal great-grandfather, for example? Cabra says:
I suppose you could get a piece of paper and start drawing dots and try to sketch out the links between the dots, but it isn't something that our brain does easily and it’s certainly not scalable. That's what's so powerful about Linkurious and Neo4j : all of a sudden, all the data that the journalist had been searching for in documents is presented in a very different way, showing connections that they probably missed first time around.
These abilities were a huge boost to the Swiss Leaks investigation, she says. Following the publication of the ICIJ’s report, more than 50 news organisations worldwide were only too happy to report on how HSBC had helped shelter criminals, traffickers and tax evaders - information that the bank would much rather have stayed hidden.